Title: | Complement to 'Modern Data Science with R' |
---|---|
Description: | A complement to all editions of *Modern Data Science with R* (ISBN: 978-0367191498, publisher URL: <https://www.routledge.com/Modern-Data-Science-with-R/Baumer-Kaplan-Horton/p/book/9780367191498>). This package contains data and code to complete exercises and reproduce examples from the text. It also facilitates connections to the SQL database server used in the book. All editions of the book are supported by this package. |
Authors: | Benjamin S. Baumer [aut, cre] , Nicholas Horton [aut] , Daniel Kaplan [aut] |
Maintainer: | Benjamin S. Baumer <[email protected]> |
License: | CC0 |
Version: | 0.2.8 |
Built: | 2024-11-17 05:42:19 UTC |
Source: | https://github.com/mdsr-book/mdsr |
Cherry Blossom runs
Cherry
Cherry
An object of class tibble::tbl_df with 41,248 rows and 8 columns. Each row refers to an individual runner in one race of the Cherry Blossom Ten Miler. The data cover the years 1999 to 2008. All of the runners listed ran at least two of the races in that period, some ran many more than that.
a unique identifier for each runner composed of the runner's full name and year of birth.
integer giving the runner's age in the race whose result is being reported.
the number of minutes elapsed from the starter's gun to the person crossing the finish line
the number of minutes elapsed from the runner's crossing the start line to crossing the finish line.
the runner's sex
the year of that race
integer specifying how many times previous to this race the runner had participated in the years 1999 to 2008.
integer giving the total number of times that runner participated in the years from 1999 to 2008. The smallest is 2, the largest is 10.
integer giving the total number of times that runner participated in the years from 1999 to 2008. The smallest is 2, the largest is 10.
The Cherry Blossom 10 Mile Run is a road race held in Washington, D.C. in April each year. (The name comes from the famous cherry trees that are in bloom in April in Washington.) The results of this race are published at https://www.cherryblossom.org/post-race/race-results/.
https://www.cherryblossom.org/post-race/race-results/.
Data Science in R, Nolan and Temple Lang (ISBN 978-1482234817), Ch. 2
if (require(dplyr)) { Cherry |> group_by(name.yob) |> count() |> group_by(n) |> count(name = "appearances") }
if (require(dplyr)) { Cherry |> group_by(name.yob) |> count() |> group_by(n) |> count(name = "appearances") }
Deaths and Pumps from 1854 London cholera outbreak
CholeraDeaths CholeraPumps
CholeraDeaths CholeraPumps
An object of class sf::sf()
whose data attribute has 250 rows and 2 columns.
An object of class sf::sf.
Both spatial objects are projected in EPSG:27700, aka the British National Grid.
https://blog.rtwilson.com/john-snows-cholera-data-in-more-formats/
if (require(sf)) { plot(st_geometry(CholeraDeaths)) }
if (require(sf)) { plot(st_geometry(CholeraDeaths)) }
The CIA Factbook has geographic, demographic, and economic data on a country-by-country basis. In the description of the variables, the 4-digit number indicates the code used to specify that variable on the data and documentation web site.
CIACountries
CIACountries
A data frame with the following variables for each of the Countries in the World. (236 countries are given.)
Name of the country
number of people, 2119
area (sq km), 2147
Crude oil - production (bbl/day), 2241
Gross Domestic Product per capita ($/person), 2001
education spending (% of GDP), 2206
Roadways per unit area (km/sq km), 2085
Fraction of Internet users (% of population), 2153
From the CIA World Factbook, https://www.cia.gov/the-world-factbook/
https://github.com/factbook/factbook/blob/master/CATEGORIES.md
str(CIACountries)
str(CIACountries)
Papers matching the search string "Data Science" on arXiv.org in August, 2020
DataSciencePapers
DataSciencePapers
A data frame with 1089 observations on the following 15 variables.
unique arXiv.org identifier for the paper
date submitted
date last updated
title of the paper
contents of the abstract
authors of the paper
affiliations of the authors
direct link to the abstract
direct link to the pdf
direct link to the digital object identifier (doi)
commentary
reference to the journal (if published)
digital object identifier
arXiv.org primary category
arXiv.org categories
data(DataSciencePapers) str(DataSciencePapers)
data(DataSciencePapers) str(DataSciencePapers)
Election Statistics from the 2013 Minneapolis Mayoral Election
Elections
Elections
An object of class tibble::tbl_df with 117 rows and 13 columns.
Number of the ward
Number of the precinct
Number of registered votes as of 7 am
Number of voters registering at the polls
Number of voters registering by absentee
Total number of registered voters
Number of voters at the polls
Number of absentee voters
Number of total ballots cast
Total number of voters turning out
Percentage of absentee voters
Percentage of voters relative to total number of people
Number of spolied ballots
https://vote.minneapolismn.gov/results-data/election-results/2013/mayor/
The training dataset includes a set of email subject lines used for classification
of whether the message is spam (unsolicited commercial content) or not.
Many subject lines include subject matter inappropriate for classroom use.
Given the volume of headlines containing such language
(especially for spam == TRUE
), user discretion is advised.
This dataset is a random sample of 80% of the emails data.
The testing dataset is a random sample of 20% of the emails data.
Emails_train Emails_test
Emails_train Emails_test
A data frame with 5,526 rows and 3 variables:
an integer vector
a character vector
a character vector
A data frame with 1,382 rows and 3 variables:
Originally retrieved from https://www.stat.berkeley.edu/~nolan/data/spam/SpamAssassinMessages.zip
Data Science in R, Nolan and Temple Lang (ISBN 978-1482234817), Ch. 3
nrow(Emails_train) nrow(Emails_test)
nrow(Emails_train) nrow(Emails_test)
Load the NCI60 data from GitHub
etl_NCI60()
etl_NCI60()
# The file is 5.0 MB NCI60 <- etl_NCI60()
# The file is 5.0 MB NCI60 <- etl_NCI60()
This data comes from Chakraborty et. al., which combines headlines from
a variety of news and clickbait sources. Some headlines contain
subject matter inappropriate for classroom use. Given the volume of headlines
containing such language (especially for clickbait == TRUE
), this filtering
might not catch all problematic headlines. User discretion is advised.
The training dataset is a random sample of approximately 80% of the observations
from the original dataset.
The testing dataset is a random sample of the remaining 20% of the observations not found in the training set.
Headlines_train Headlines_test
Headlines_train Headlines_test
A data frame with 18,360 rows and 3 variables:
a character vector
a logical vector
an integer vector
A data frame with 4,589 rows and 3 variables:
https://github.com/bhargaviparanjape/clickbait/
doi:10.1109/ASONAM.2016.7752207
nrow(Headlines_train) nrow(Headlines_test)
nrow(Headlines_train) nrow(Headlines_test)
The entire text of Macbeth, stored in a character vector of length 1.
Macbeth_raw
Macbeth_raw
A character vector of length 1
Project Gutenberg, https://www.gutenberg.org/ebooks/1129/
Wrangle babynames data
make_babynames_dist()
make_babynames_dist()
a tibble::tbl_df similar to babynames::babynames with a column for the estimated number of people alive in 2014.
BabynameDist <- make_babynames_dist() if (require(dplyr)) { BabynameDist |> filter(name == "Benjamin") }
BabynameDist <- make_babynames_dist() if (require(dplyr)) { BabynameDist |> filter(name == "Benjamin") }
Custom table output
mdsr_table(x, ...) mdsr_sql_explain_table(x, ...) mdsr_sql_keys_table(x, ...)
mdsr_table(x, ...) mdsr_sql_explain_table(x, ...) mdsr_sql_keys_table(x, ...)
x |
A data.frame |
... |
arguments passed to |
mdsr_table(faithful)
mdsr_table(faithful)
These data for 2011, released in May 2013, describe how much hospitals charged Medicare for various inpatient procedures, how many were performed, and how much Medicare actually paid.
MedicareCharges
MedicareCharges
A data frame with 5,025 observations on the following 4 variables.
Code for the Diagnosis Related Group: a character string that looks like a number.
the state providing the care.
the total number of charges.
the average charge for each drg
across each state
These data are part of a set with DiagnosisRelatedGroup
, which gives a
description of the medical procedure associated with each DRG, and
MedicareProviders, which translates idProvider
into a name,
address, state, Zip, etc..
These data have been pre-aggregated by state.
Data from the Centers for Medicare and Medicaid Services. See https://data.cms.gov/provider-summary-by-type-of-service/medicare-inpatient-hospitals/
data(MedicareCharges)
data(MedicareCharges)
Name and location data for the medicare providers in the MedicareCharges data table.
MedicareProviders
MedicareProviders
A data frame with 3337 observations on the following 7 variables.
a unique number assigned to each provider
Name of the provider. (text string)
Street address of the provider. (text string)
The name of the city in which the provider is located. (factor)
The two-letter postal code of the state in which the provider is located. (factor)
The provider's ZIP code. (factor)
An identifier for the region serviced by the provider.
This data table is related to MedicareCharges data.
Extracted from the highly repetitive table provided by the Centers for Medicare and Medicaid Services. See https://data.cms.gov/provider-summary-by-type-of-service/medicare-inpatient-hospitals/
data(MedicareProviders)
data(MedicareProviders)
The choices marked on each (valid) ballot for the election, which was run using a rank-choice, instant runoff system.
Minneapolis2013
Minneapolis2013
A data frame with 80,101 observations on the following 5 variables. All are stored as character strings.
Precincts are sub-divisions within Wards
The voter's first choice
The voter's second choice
The voter's third choice
The city is divided spatially into districts or 'wards'. These are further subdivided into precincts.
Ballot information for the 2013 Minneapolis Mayoral election, which was run as a rank-choice election. In rank-choice, a voter can indicate first, second, and third choices. If a voter's first choice is eliminated (by being last in the count across voters), the second choice is promoted to that voter's first choice, and similarly third -> second. Eliminations are done successively until one candidate has a majority of the first-choice votes.
Ballot data from the Minneapolis city government: https://vote.minneapolismn.gov/results-data/election-results/2013/mayor/
Description of ranked-choice voting: https://vote.minneapolismn.gov/ranked-choice-voting/
A Minnesota Public Radio story about the election ballot tallying process: https://www.mprnews.org/2013/11/22/politics/ranked-choice-vote-count-programmers/
The Wikipedia article about the election: https://en.wikipedia.org/wiki/2013_Minneapolis_mayoral_election
data(Minneapolis2013)
data(Minneapolis2013)
A dataset containing information about Major League Baseball teams from 2008-2014.
MLB_teams
MLB_teams
A tibble::tbl_df object.
season in which the team played
the team's three character identifier
the league in which the team played
number of wins
number of losses
winning percentage
number of fans in attendance
number of fans in attendance, relative to the team with the highest attendance in this sample (the 2008 New York Yankees)
the sum of the salaries of the players on each team. Note that this number is only an estimate of the actual team payroll – and may not even be a very good one. Salaries are accumulated from Lahman::Salaries
the size of the team's home city's metropolitan population, according to Wikipedia and the 2010 US Census
the full name of the team
The Lahman::Teams table from Lahman::Lahman-package and https://en.wikipedia.org/wiki/List_of_Metropolitan_Statistical_Areas
The data come from a National Cancer Institute study of gene expression in cell lines drawn from various sorts of cancer.
NCI60_tiny Cancer
NCI60_tiny Cancer
The expression data, NCI60_tiny is a dataframe of 41,078 gene probes (rows)
and 60 cell lines (columns). The first column, Probe
gives the name
of the Agilent microarray probe. Each of the remaining columns is named for
a cell line. The value is the log-2 expression associated with that probe
for the cell line.
the name of the Agilent microarray probe
For Cancer:
a character vector giving the name of one cell line
a character vector giving the name of another cell line
the correlation between the two cell lines. See stats::cor()
An object of class tbl_df
(inherits from tbl
, data.frame
) with 1770 rows and 3 columns.
Cancer gives information about each cell line.
Staunton et al. (2001), PNAS (doi:10.1073/pnas.191368598)
D.T. Ross et al. (2000) Nature Genetics, 24(3):227-234 (doi:10.1038/73432)
data(NCI60_tiny)
data(NCI60_tiny)
The historical record of birds captured and released at the Katharine Ordway Natural History Study Area, a 278-acre preserve in Inver Grove Heights, Minnesota, owned and managed by Macalester College.
ordway_birds
ordway_birds
A data frame with 15,829 observations on the bird's species, size, date found, and band number.
a character vector
Timestamp indicates when the data were entered into an electronic record, not anything about the bird being described
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
a character vector
Timestamp indicates when the data were entered into an electronic record, not anything about the bird being described.
There are many extraneous levels of variables such as species. Part of the purpose of this data set is to teach about data cleaning.
Jerald Dosch, Dept. of Biology, Macalester College: the manager of the Study Area.
https://www.macalester.edu/ordway/
ordway_birds
ordway_birds
Convert Rnw to Rmd
Rnw2Rmd(path, new_path = NULL)
Rnw2Rmd(path, new_path = NULL)
path |
A character vector of one or more paths. |
new_path |
New file path. If Should either be the same length as |
Saratoga Houses
saratoga_houses saratoga_codes
saratoga_houses saratoga_codes
A tibble with 1728 rows and 16 variables:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
@examples saratoga_houses
An object of class spec_tbl_df
(inherits from tbl_df
, tbl
, data.frame
) with 13 rows and 3 columns.
SAT results by state for 2010
SAT_2010
SAT_2010
A data.frame with 50 rows and 9 variables.
a factor with levels for each state
average expenditure per student (in each state)
pupil to teacher ratio in that state
teacher salary (in 2010 US $)
state average Reading SAT score
state average Math SAT score
state average Writing SAT score
state average Total SAT score
percent of students taking SAT in that state
See also the earlier mosaicData::SAT dataset.
Embedded webshot of leaflet map
save_webshot( map, path_to_img, overwrite = FALSE, vwidth = 800, vheight = 600, cliprect = "viewport", ... )
save_webshot( map, path_to_img, overwrite = FALSE, vwidth = 800, vheight = 600, cliprect = "viewport", ... )
map |
A leaflet map object |
path_to_img |
A path to the image file to save |
overwrite |
Do you want to clobber any existing file? |
vwidth |
Viewport width. This is the width of the browser "window". |
vheight |
Viewport height This is the height of the browser "window". |
cliprect |
Clipping rectangle. If |
... |
arguments passed to |
a path to a PNG file
## Not run: if (require(leaflet)) { map <- leaflet() |> addTiles() |> addMarkers(lng = 174.768, lat = -36.852, popup = "The birthplace of R") save_webshot(map, tempfile()) } ## End(Not run)
## Not run: if (require(leaflet)) { map <- leaflet() |> addTiles() |> addMarkers(lng = 174.768, lat = -36.852, popup = "The birthplace of R") save_webshot(map, tempfile()) } ## End(Not run)
Custom skimmer
skim(data, ...)
skim(data, ...)
data |
A tibble, or an object that can be coerced into a tibble. |
... |
Columns to select for skimming. When none are provided, the default is to skim all columns. |
skim(faithful)
skim(faithful)
Connect to the scidb server on Amazon Web Services.
src_scidb(dbname, ...) dbConnect_scidb(dbname, ...) mysql_scidb(dbname, ...)
src_scidb(dbname, ...) dbConnect_scidb(dbname, ...) mysql_scidb(dbname, ...)
dbname |
the name of the database to which you want to connect |
... |
arguments passed to |
This is a public, read-only account. Any abuse will be considered a hostile act.
The MariaDB server accessible via these functions is a db.t3.micro RDS instance hosted by Amazon Web Services. It is NOT a powerful server, having only 2 CPUs, 1 GB of RAM, and 20 GB of disk space. It is useful for quick, efficient and no-stress setup, but not useful for any kind of serious computing.
The airlines
database on the server contains complete flight records for
the three years between 2013 and 2015, which contains about 6 million rows
annually.
Thus, the flights
table contains approximately 18 million rows.
The flights
table has several indexes, including an indices on year
,
origin
, dest
, carrier
, and tailnum
.
There is also a composite index on the date (across year
, month
, and day
).
Please use these indexes to improve query response times.
There are two databases on this server:
airlines
: The structure of the database is similar to what you find in
the nycflights13
and nycflights23
packages. See their documentation at
nycflights13::flights and nycflights23::airports, for example.
imdb
: These data were retrieved from an old dump of the Internet Movie
Database, circa 2016. Please see this ER diagram
for relationships between the tables.
For src_scidb()
, a dbplyr::src_dbi object
For dbConnect_scidb()
, a RMariaDB::MariaDBConnection object
For mysql_scidb()
, a character vector of length 1 to be used
as an engine.ops
argument, or on the command line.
dbplyr::src_dbi()
, nycflights13::flights, nycflights23::airlines
# Connect to the database instance via `dplyr` db_air <- src_scidb("airlines") db_air # Connect to the database instance via `DBI` (recommended) db_air <- dbConnect_scidb("airlines") db_air # Get more information... if (require(DBI)) { # About the database instance dbGetInfo(db_air) # About the available tables dbListTables(db_air) # About the variables in a particular table dbListFields(db_air, "flights") # About the indexes (using raw SQL) dbGetQuery(db_air, "SHOW KEYS FROM flights") } if (require(knitr)) { opts_chunk$set(engine.opts = mysql_scidb("airlines")) }
# Connect to the database instance via `dplyr` db_air <- src_scidb("airlines") db_air # Connect to the database instance via `DBI` (recommended) db_air <- dbConnect_scidb("airlines") db_air # Get more information... if (require(DBI)) { # About the database instance dbGetInfo(db_air) # About the available tables dbListTables(db_air) # About the variables in a particular table dbListFields(db_air, "flights") # About the indexes (using raw SQL) dbGetQuery(db_air, "SHOW KEYS FROM flights") } if (require(knitr)) { opts_chunk$set(engine.opts = mysql_scidb("airlines")) }
Graphical themes used in MDSR book
theme_mdsr(base_size = 12, base_family = "Bookman")
theme_mdsr(base_size = 12, base_family = "Bookman")
base_size |
base font size, given in pts. |
base_family |
base font family |
if (require(ggplot2)) { p <- ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) + geom_point() + facet_wrap(~ am) + geom_smooth() p + theme_grey() p + theme_mdsr() }
if (require(ggplot2)) { p <- ggplot(mtcars, aes(x = hp, y = mpg, color = factor(cyl))) + geom_point() + facet_wrap(~ am) + geom_smooth() p + theme_grey() p + theme_mdsr() }
NYC Restaurant Health Violations
Violations ViolationCodes Cuisines
Violations ViolationCodes Cuisines
A data frame with 480,621 observations on the following 16 variables.
unique identifier
full name doing business as
borough of New York
building name
street address
zipcode
phone number
inspection date
action taken
violation code, see ViolationCodes
inspection score
inspection grade
grade date
recording date
inspect type
cuisine code, see Cuisines
A data frame with 174 observations on the following 3 variables.
a factor with many levels
is violation critical: a factor with levels N
, Y
violation description
A data frame with 84 observations on the following 2 variables.
a character vector
a character vector
data(Violations) if (require(dplyr)) { Violations |> inner_join(Cuisines, by = "cuisine_code") |> filter(cuisine_description == "American") |> arrange(grade_date) |> head() }
data(Violations) if (require(dplyr)) { Violations |> inner_join(Cuisines, by = "cuisine_code") |> filter(cuisine_description == "American") |> arrange(grade_date) |> head() }
Votes recorded on each ballot by each member of the Scottish Parliament in 2008 along with information about party affiliation.
Votes Parties
Votes Parties
Votes is a data.frame with 103582 rows and 3 variables.
an identifier for the bill
the name of the member of parliament
1 means a vote for, -1 a vote against. 0 is an abstention.
Parties is a data.frame with 134 rows, one for each member of parliament, and 2 variables.
the name of the political party the member belongs to
the name of the member of parliament
An object of class data.frame
with 134 rows and 2 columns.
Almost all of the members of parliament belongs to a political party. This table identifies that party. These data were provided by Caroline Ettinger and form part of her senior honor's project at Macalester College. Prof. Andrew Beveridge supervised the thesis. Ms. Ettinger used the vote data to explore how to extract the party association of members purely from voting records. The Parties data was used to evaluate the success of methods.
A list of cities
world_cities
world_cities
A data frame with 4,428 observations on the following 10 variables.
integer id of record in geonames database
name of geographical point in plain ascii characters
latitude in decimal degrees (wgs84)
longitude in decimal degrees (wgs84)
ISO-3166 2-letter country code
fipscode
Population
the iana timezone id
date of last modification
GeoNames: http://download.geonames.org/export/dump/
world_cities
world_cities