Today’s wrangling exercises are modified from Master the Tidyverse.
The dplyr
package is where the most common data wrangling “verbs” come from. The main ones are:
select()
to pick out variables/columnsfilter()
to pick out cases/rowsarrange()
, which arranges casessummarize()
, a many-to-few or many-to-one operation to suummarizemutate()
, a many-to-many operation to add additional variables.group_by()
, a helper to create grouped data framesWe’ll focus primarily on filter()
, summarize()
, group_by()
, and mutate()
. For this section, we’ll be working with the babynames
data, which comes in its own package
# install.packages("babynames")
library(babynames)
head(babynames)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
tail(babynames)
## # A tibble: 6 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 2017 M Zyhier 5 0.00000255
## 2 2017 M Zykai 5 0.00000255
## 3 2017 M Zykeem 5 0.00000255
## 4 2017 M Zylin 5 0.00000255
## 5 2017 M Zylis 5 0.00000255
## 6 2017 M Zyrie 5 0.00000255
This dataset includes baby name data from the US Social Security Administration, for all names which had at least 5 uses in a particular year.
Recall that we can use filter()
to exctract just the rows we’re interested in, as in
filter(babynames, name == "Amelia")
## # A tibble: 169 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Amelia 221 0.00226
## 2 1881 F Amelia 235 0.00238
## 3 1882 F Amelia 252 0.00218
## 4 1883 F Amelia 262 0.00218
## 5 1884 F Amelia 315 0.00229
## 6 1885 F Amelia 298 0.00210
## 7 1886 F Amelia 326 0.00212
## 8 1887 F Amelia 344 0.00221
## 9 1888 F Amelia 358 0.00189
## 10 1889 F Amelia 346 0.00183
## # … with 159 more rows
Other than ==
, you can use other logical tests, such as <
, >
, <=
, >=
, !=
(not equal) and %in%
(group membership).
See if you can use the logical operators to find
prop
is greater than or equal to 0.08You can combine several criteria together using Boolean operators, such as &
(and) and |
(or). Commas also work in filter()
for a list of criteria that must all be met.
filter(babynames, name=="Amelia", year==1880) #this is equivalent to
## # A tibble: 1 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Amelia 221 0.00226
filter(babynames, name=="Amelia" & year == 1880)
## # A tibble: 1 x 5
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Amelia 221 0.00226
Try to find
Summarise is a many-to-one or many-to-few operation. It reduces the size of your data by summarizing it.
babynames %>%
dplyr::summarize(total = sum(n), max = max(n))
## # A tibble: 1 x 2
## total max
## <int> <int>
## 1 348120517 99686
Use summarize to compute:
Now, put together your filter()
and summarize()
skills to extract the rows where name == “Khaleesi” and find
We’ve seen group_by()
in an incidental way many times this semester. group_by()
is a dplyr
verb that creates a grouped data frame. It can be combined with other dplyr
verbs very powerfully.
babynames %>%
group_by(sex)
## # A tibble: 1,924,665 x 5
## # Groups: sex [2]
## year sex name n prop
## <dbl> <chr> <chr> <int> <dbl>
## 1 1880 F Mary 7065 0.0724
## 2 1880 F Anna 2604 0.0267
## 3 1880 F Emma 2003 0.0205
## 4 1880 F Elizabeth 1939 0.0199
## 5 1880 F Minnie 1746 0.0179
## 6 1880 F Margaret 1578 0.0162
## 7 1880 F Ida 1472 0.0151
## 8 1880 F Alice 1414 0.0145
## 9 1880 F Bertha 1320 0.0135
## 10 1880 F Sarah 1288 0.0132
## # … with 1,924,655 more rows
babynames %>%
group_by(sex) %>%
dplyr::summarize(total = sum(n))
## # A tibble: 2 x 2
## sex total
## <chr> <int>
## 1 F 172371079
## 2 M 175749438
Mutate is the way to make a new variable in a dataset. It is a many-to-many operation.
babynames %>%
mutate(percent = round(prop*100, 2))
## # A tibble: 1,924,665 x 6
## year sex name n prop percent
## <dbl> <chr> <chr> <int> <dbl> <dbl>
## 1 1880 F Mary 7065 0.0724 7.24
## 2 1880 F Anna 2604 0.0267 2.67
## 3 1880 F Emma 2003 0.0205 2.05
## 4 1880 F Elizabeth 1939 0.0199 1.99
## 5 1880 F Minnie 1746 0.0179 1.79
## 6 1880 F Margaret 1578 0.0162 1.62
## 7 1880 F Ida 1472 0.0151 1.51
## 8 1880 F Alice 1414 0.0145 1.45
## 9 1880 F Bertha 1320 0.0135 1.35
## 10 1880 F Sarah 1288 0.0132 1.32
## # … with 1,924,655 more rows
Of course, you would need to use the assignment operator or another pipe to do something with this new variable.
If you have more than one dataset, you may want to do a “join” to connect them together in some way. This is where multiple-table verbs come in. In dplyr
, they are left_join()
, right_join()
, full_join()
, inner_join()
, semi_join()
and anti_join()
. We will focus primarily on the left_join()
because I find it to be useful the most often in data wrangling.
For this section, we’ll use a baby example about bands to illustrate how these joins differ.
band <- tribble(
~name, ~band,
"Mick", "Stones",
"John", "Beatles",
"Paul", "Beatles"
)
instrument <- tribble(
~name, ~plays,
"John", "guitar",
"Paul", "bass",
"Keith", "guitar"
)
instrument2 <- tribble(
~artist, ~plays,
"John", "guitar",
"Paul", "bass",
"Keith", "guitar"
)
band %>%
left_join(instrument, by = "name")
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
band %>%
right_join(instrument, by = "name")
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith <NA> guitar
band %>%
full_join(instrument, by = "name")
## # A tibble: 4 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith <NA> guitar
band %>%
inner_join(instrument, by = "name")
## # A tibble: 2 x 3
## name band plays
## <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
band %>%
left_join(instrument2, by = c("name" = "artist"))
## # A tibble: 3 x 3
## name band plays
## <chr> <chr> <chr>
## 1 Mick Stones <NA>
## 2 John Beatles guitar
## 3 Paul Beatles bass
The joins we’ve looked at so far are mutating joins, but there are also filtering joins, which use information from one data set to extract cases from another data set.
Semi-join
band %>%
semi_join(instrument, by = "name")
## # A tibble: 2 x 2
## name band
## <chr> <chr>
## 1 John Beatles
## 2 Paul Beatles
Anti-join
band %>%
anti_join(instrument, by = "name")
## # A tibble: 1 x 2
## name band
## <chr> <chr>
## 1 Mick Stones
For this section, we’ll use data from the nycflights13
package, which you will need to install.
# install.packages("nycflights13")
library(nycflights13)
The package has multiple datasets, with different sets of variables. The main dataset is named flights
.
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time
## <int> <int> <int> <int> <int> <dbl> <int>
## 1 2013 1 1 517 515 2 830
## 2 2013 1 1 533 529 4 850
## 3 2013 1 1 542 540 2 923
## 4 2013 1 1 544 545 -1 1004
## 5 2013 1 1 554 600 -6 812
## 6 2013 1 1 554 558 -4 740
## 7 2013 1 1 555 600 -5 913
## 8 2013 1 1 557 600 -3 709
## 9 2013 1 1 557 600 -3 838
## 10 2013 1 1 558 600 -2 753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>
But, the variable carrier
doesn’t make much sense to people outside the airline industry. So, we’d like to join on some easier-to-understand names, which are in the airlines
dataset.
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
flights %>%
drop_na(arr_delay) %>%
%>% # add some joining code here
group_by( ) %>% # add a variable here
%>% # compute the average arrival delay
arrange( ) # arrange by delays
faa
in the airports
dataset and dest
in the flights
dataset.