Data wrangling, part II

Today’s wrangling exercises are modified from Master the Tidyverse.

Single-table verbs

The dplyr package is where the most common data wrangling “verbs” come from. The main ones are:

select() to pick out variables/columns
filter() to pick out cases/rows
arrange(), which arranges cases
summarize(), a many-to-few or many-to-one operation to suummarize
mutate(), a many-to-many operation to add additional variables.
group_by(), a helper to create grouped data frames

We’ll focus primarily on filter(), summarize(), group_by(), and mutate(). For this section, we’ll be working with the babynames data, which comes in its own package

# install.packages("babynames")
library(babynames)
head(babynames)

## # A tibble: 6 x 5
##    year sex   name          n   prop
##   <dbl> <chr> <chr>     <int>  <dbl>
## 1  1880 F     Mary       7065 0.0724
## 2  1880 F     Anna       2604 0.0267
## 3  1880 F     Emma       2003 0.0205
## 4  1880 F     Elizabeth  1939 0.0199
## 5  1880 F     Minnie     1746 0.0179
## 6  1880 F     Margaret   1578 0.0162

tail(babynames)

## # A tibble: 6 x 5
##    year sex   name       n       prop
##   <dbl> <chr> <chr>  <int>      <dbl>
## 1  2017 M     Zyhier     5 0.00000255
## 2  2017 M     Zykai      5 0.00000255
## 3  2017 M     Zykeem     5 0.00000255
## 4  2017 M     Zylin      5 0.00000255
## 5  2017 M     Zylis      5 0.00000255
## 6  2017 M     Zyrie      5 0.00000255

This dataset includes baby name data from the US Social Security Administration, for all names which had at least 5 uses in a particular year.

Filter

Recall that we can use filter() to exctract just the rows we’re interested in, as in

filter(babynames, name == "Amelia")

## # A tibble: 169 x 5
##     year sex   name       n    prop
##    <dbl> <chr> <chr>  <int>   <dbl>
##  1  1880 F     Amelia   221 0.00226
##  2  1881 F     Amelia   235 0.00238
##  3  1882 F     Amelia   252 0.00218
##  4  1883 F     Amelia   262 0.00218
##  5  1884 F     Amelia   315 0.00229
##  6  1885 F     Amelia   298 0.00210
##  7  1886 F     Amelia   326 0.00212
##  8  1887 F     Amelia   344 0.00221
##  9  1888 F     Amelia   358 0.00189
## 10  1889 F     Amelia   346 0.00183
## # … with 159 more rows

Other than ==, you can use other logical tests, such as <, >, <=, >=, != (not equal) and %in% (group membership).

See if you can use the logical operators to find

all of the names where prop is greater than or equal to 0.08
all of the children named “Sea”

You can combine several criteria together using Boolean operators, such as & (and) and | (or). Commas also work in filter() for a list of criteria that must all be met.

filter(babynames, name=="Amelia", year==1880) #this is equivalent to

## # A tibble: 1 x 5
##    year sex   name       n    prop
##   <dbl> <chr> <chr>  <int>   <dbl>
## 1  1880 F     Amelia   221 0.00226

filter(babynames, name=="Amelia" & year == 1880)

## # A tibble: 1 x 5
##    year sex   name       n    prop
##   <dbl> <chr> <chr>  <int>   <dbl>
## 1  1880 F     Amelia   221 0.00226

Try to find

All girls named “Sea”
Names that were used by exactly 5 or 6 children in 1880
Names that are one of Acura, Lexus, or Yugo
The row that corresponds to you (or, a good friend, if you have an uncommon name)

Summarize

Summarise is a many-to-one or many-to-few operation. It reduces the size of your data by summarizing it.

babynames %>% 
  dplyr::summarize(total = sum(n), max = max(n))

## # A tibble: 1 x 2
##       total   max
##       <int> <int>
## 1 348120517 99686

Use summarize to compute:

The first (minimum) year in the dataset
The last (maximum) year in the dataset

Now, put together your filter() and summarize() skills to extract the rows where name == “Khaleesi” and find

The total number of children named Khaleesi
The first year Khaleesi appeared in the data

Group by

We’ve seen group_by() in an incidental way many times this semester. group_by() is a dplyr verb that creates a grouped data frame. It can be combined with other dplyr verbs very powerfully.

babynames %>%
  group_by(sex)

## # A tibble: 1,924,665 x 5
## # Groups:   sex [2]
##     year sex   name          n   prop
##    <dbl> <chr> <chr>     <int>  <dbl>
##  1  1880 F     Mary       7065 0.0724
##  2  1880 F     Anna       2604 0.0267
##  3  1880 F     Emma       2003 0.0205
##  4  1880 F     Elizabeth  1939 0.0199
##  5  1880 F     Minnie     1746 0.0179
##  6  1880 F     Margaret   1578 0.0162
##  7  1880 F     Ida        1472 0.0151
##  8  1880 F     Alice      1414 0.0145
##  9  1880 F     Bertha     1320 0.0135
## 10  1880 F     Sarah      1288 0.0132
## # … with 1,924,655 more rows

babynames %>%
  group_by(sex) %>%
  dplyr::summarize(total = sum(n))

## # A tibble: 2 x 2
##   sex       total
##   <chr>     <int>
## 1 F     172371079
## 2 M     175749438

Mutate

Mutate is the way to make a new variable in a dataset. It is a many-to-many operation.

babynames %>%
  mutate(percent = round(prop*100, 2))

## # A tibble: 1,924,665 x 6
##     year sex   name          n   prop percent
##    <dbl> <chr> <chr>     <int>  <dbl>   <dbl>
##  1  1880 F     Mary       7065 0.0724    7.24
##  2  1880 F     Anna       2604 0.0267    2.67
##  3  1880 F     Emma       2003 0.0205    2.05
##  4  1880 F     Elizabeth  1939 0.0199    1.99
##  5  1880 F     Minnie     1746 0.0179    1.79
##  6  1880 F     Margaret   1578 0.0162    1.62
##  7  1880 F     Ida        1472 0.0151    1.51
##  8  1880 F     Alice      1414 0.0145    1.45
##  9  1880 F     Bertha     1320 0.0135    1.35
## 10  1880 F     Sarah      1288 0.0132    1.32
## # … with 1,924,655 more rows

Of course, you would need to use the assignment operator or another pipe to do something with this new variable.

Multiple table verbs

If you have more than one dataset, you may want to do a “join” to connect them together in some way. This is where multiple-table verbs come in. In dplyr, they are left_join(), right_join(), full_join(), inner_join(), semi_join() and anti_join(). We will focus primarily on the left_join() because I find it to be useful the most often in data wrangling.

For this section, we’ll use a baby example about bands to illustrate how these joins differ.

band <- tribble(
  ~name,  ~band,
  "Mick", "Stones",
  "John", "Beatles",
  "Paul", "Beatles"
)
instrument <- tribble(
  ~name, ~plays,
  "John", "guitar",
  "Paul", "bass",
  "Keith", "guitar"
)

instrument2 <- tribble(
  ~artist, ~plays,
  "John", "guitar",
  "Paul", "bass",
  "Keith", "guitar"
)

Left join

band %>%
  left_join(instrument, by = "name")

## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

Right join

band %>%
  right_join(instrument, by = "name")

## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass  
## 3 Keith <NA>    guitar

How do the results of the left and right joins differ?

Full join

band %>%
  full_join(instrument, by = "name")

## # A tibble: 4 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass  
## 4 Keith <NA>    guitar

Inner join

band %>%
  inner_join(instrument, by = "name")

## # A tibble: 2 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 John  Beatles guitar
## 2 Paul  Beatles bass

What if the names don’t match?

band %>% 
  left_join(instrument2, by = c("name" = "artist"))

## # A tibble: 3 x 3
##   name  band    plays 
##   <chr> <chr>   <chr> 
## 1 Mick  Stones  <NA>  
## 2 John  Beatles guitar
## 3 Paul  Beatles bass

Filtering joins

The joins we’ve looked at so far are mutating joins, but there are also filtering joins, which use information from one data set to extract cases from another data set.

Semi-join

band %>% 
  semi_join(instrument, by = "name")

## # A tibble: 2 x 2
##   name  band   
##   <chr> <chr>  
## 1 John  Beatles
## 2 Paul  Beatles

Anti-join

band %>% 
  anti_join(instrument, by = "name")

## # A tibble: 1 x 2
##   name  band  
##   <chr> <chr> 
## 1 Mick  Stones

Practice

For this section, we’ll use data from the nycflights13 package, which you will need to install.

# install.packages("nycflights13")
library(nycflights13)

The package has multiple datasets, with different sets of variables. The main dataset is named flights.

flights

## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>
##  1  2013     1     1      517            515         2      830
##  2  2013     1     1      533            529         4      850
##  3  2013     1     1      542            540         2      923
##  4  2013     1     1      544            545        -1     1004
##  5  2013     1     1      554            600        -6      812
##  6  2013     1     1      554            558        -4      740
##  7  2013     1     1      555            600        -5      913
##  8  2013     1     1      557            600        -3      709
##  9  2013     1     1      557            600        -3      838
## 10  2013     1     1      558            600        -2      753
## # … with 336,766 more rows, and 12 more variables: sched_arr_time <int>,
## #   arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>

But, the variable carrier doesn’t make much sense to people outside the airline industry. So, we’d like to join on some easier-to-understand names, which are in the airlines dataset.

airlines

## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

Which airlines had the largest arrival delays? Work in groups to complete the code below.

flights %>%
  drop_na(arr_delay) %>%
                     %>% # add some joining code here
  group_by(        ) %>% # add a variable here
                     %>% # compute the average arrival delay
  arrange(         ) # arrange by delays

How many airports in airports are serviced by flights originating in New York (i.e. flights in our dataset?) Notice that the column to join on is named faa in the airports dataset and dest in the flights dataset.