Data wrangling Skills

library(tidyverse)
library(Stat2Data)
library(skimr)

Cleaning up your environment

Now is a good time to clean up your Environment in R. I recommend clicking the broom icon on your Environment tab and removing all the objects from your environment. If you’re worried about a particular dataset, you can skip down to the bottom of the lab and learn how to use write_csv() to save data out of R. But in general, there is no danger to cleaning up your Environment.

Now, quit R and when it asks if you want to save your workspace image, select Yes. This is the last time I want you to select that! In general, you want to select No because you don’t want to preserve your environment.

Reshaping data

Sometimes, the data you get is in the wrong “shape.” Recall that at the beginning of the semester we talked about “tidy” data– data where every row is an observation at the same observational level, and every column is a variable. I usually think about this as rectangular data, but people will also sometimes call tidy data “tall” and un-tidy data “wide.”

We actually had a great example of this in problem 5.24, on baseball data

data(FantasyBaseball)
head(FantasyBaseball)

##   Round DJ  AR  BK  JW TS  RL  DR  MF
## 1     1 42  35  49 104 15  40  26 101
## 2     2 84  26  65 101 17 143  43  16
## 3     3 21  95 115  53 66 103 113  88
## 4     4 99  41  66 123  6 144  16  79
## 5     5 25 129  53 144  6 162 113  48
## 6     6 89  62  80 247 17  55 369   2

Here, each row is a round and each variable is a person. That doesn’t make a ton of sense. We would prefer to think about each row as a person, with variables for the round and their time for that round. We can use the gather/spread functions from tidyr to fix this.

I usually don’t get this right on the first try, so I run a few variations in my Console and just keep the one that looks the way I want. To remember which function I want, I look at the documentation

?spread
?gather

The documentation for gather says it “takes multiple columns and collapses into key-value pairs. You use gather() when you notice that you have columns that are not variables” That’s what we want here.

gather(FantasyBaseball, Player, Time, Round) #nope! That's even worse than before!
gather(FantasyBaseball, Time, Round) #almost right 
gather(FantasyBaseball, Player, Time, -Round) #what we want

Once I’ve identified the way I want my data to look, I can overwrite the original data with my gathered version,

FantasyBaseball <- FantasyBaseball %>%
  gather(Player, Time, -Round) 
head(FantasyBaseball)

##   Round Player Time
## 1     1     DJ   42
## 2     2     DJ   84
## 3     3     DJ   21
## 4     4     DJ   99
## 5     5     DJ   25
## 6     6     DJ   89

Now it looks the way I would like!

Taking a sample of data

Another task you may want to do for your project is take a random sample of your data, either because you want to have testing and training data, or you just need a smaller dataset that works on your laptop.

R is great at taking random samples. In base R, the sample() function lets you take samples.

Again, we can read the documentation

?sample

Before we do any sampling, however, we should “set a seed” for our randomness, so our work is reproducible. If you don’t do this, the next time you or one of your group members runs the same code, you will get a different result.

Let’s just see what this looks like in a simple case

sample(1:10, size = 5)

## [1] 9 7 1 5 8

sample(1:10, size = 5)

## [1] 3 7 5 4 1

Versus if we set a seed

set.seed(123)
sample(1:10, size = 5)

## [1] 3 8 4 7 6

set.seed(123)
sample(1:10, size = 5)

## [1] 3 8 4 7 6

You can put any number in the set.seed() function. I typically use 123, 42, or my birthday. You can put whatever you like.

Now, let’s apply this to taking a sample from data. Let’s say that for some reason we wanted a sample from the FantasyBaseball data.

skim(FantasyBaseball)

## Skim summary statistics
##  n obs: 192 
##  n variables: 3 
## 
## ── Variable type:character ────────────────────────────────────────────────
##  variable missing complete   n min max empty n_unique
##    Player       0      192 192   2   2     0        8
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##  variable missing complete   n  mean    sd p0   p25  p50    p75 p100
##     Round       0      192 192 12.5   6.94  1  6.75 12.5  18.25   24
##      Time       0      192 192 72.52 71.64  1 21    51.5 101.25  436
##      hist
##  ▇▇▇▇▇▇▇▇
##  ▇▃▂▁▁▁▁▁

It has 192 rows, so maybe I want to sample 50% of them or 96 rows. There are several ways to do this. One is to use the base R sample() function, and then slice() those rows out.

set.seed(123)
rows <- sample(1:192, size=96)
mini_baseball <- FantasyBaseball %>%
  slice(rows)

skim(mini_baseball)

## Skim summary statistics
##  n obs: 96 
##  n variables: 3 
## 
## ── Variable type:character ────────────────────────────────────────────────
##  variable missing complete  n min max empty n_unique
##    Player       0       96 96   2   2     0        8
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##  variable missing complete  n  mean    sd p0 p25 p50    p75 p100     hist
##     Round       0       96 96 12.62  6.98  1   7  12  18.25   24 ▆▆▇▆▆▆▅▇
##      Time       0       96 96 75.55 72.47  1  26  50 103.5   399 ▇▃▂▁▁▁▁▁

Now, my sample has all the same variables as my original data, but it’s half the size. You can apply this same strategy to take 25%, 10%, or any other size of sample you want to do.

Of course, the tidyverse also has some handy convenience functions that abstract away some of the sampling. Let’s try sample_n() and sample_frac().

mini_baseball2 <- FantasyBaseball %>%
  sample_n(size=96)
mini_baseball3 <- FantasyBaseball %>%
  sample_frac(size=0.5)

Importing data

doctorsDot <- read.csv("Physicians.csv")
skim(doctorsDot)

## Skim summary statistics
##  n obs: 53 
##  n variables: 2 
## 
## ── Variable type:factor ───────────────────────────────────────────────────
##                           variable missing complete  n n_unique
##  Physicians.per.100.000.Population       0       53 53       49
##                              State       0       53 53       53
##                      top_counts ordered
##  361: 2, 370: 2, 400: 2, 478: 2   FALSE
##  Ala: 1, Ala: 1, All: 1, Ari: 1   FALSE

This example illustrates one issue with using read.csv()– it wants to make everything into factors, which we don’t want! I’ll keep that data around to show something later on.

doctors <- read_csv("Physicians.csv")

## Parsed with column specification:
## cols(
##   State = col_character(),
##   `Physicians per 100,000 Population` = col_character()
## )

skim(doctors)

## Skim summary statistics
##  n obs: 53 
##  n variables: 2 
## 
## ── Variable type:character ────────────────────────────────────────────────
##                           variable missing complete  n min max empty
##  Physicians per 100,000 Population       0       53 53   3   4     0
##                              State       0       53 53   4  14     0
##  n_unique
##        49
##        53

This is better, because it’s making the variables character vectors, which we can work with more easily.

Renaming variables

The variable names in this dataset are going to be hard to use,

names(doctors)

## [1] "State"                             "Physicians per 100,000 Population"

In order to run something like skim(), we’d have to use backtics to write out the variable name.

doctors %>%
  skim(`Physicians per 100,000 Population`)

## Skim summary statistics
##  n obs: 53 
##  n variables: 2 
## 
## ── Variable type:character ────────────────────────────────────────────────
##                           variable missing complete  n min max empty
##  Physicians per 100,000 Population       0       53 53   3   4     0
##  n_unique
##        49

We can manually rename that variable,

doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)

## # A tibble: 53 x 2
##    State       Physicians
##    <chr>       <chr>     
##  1 Alaska      514       
##  2 Alabama     331       
##  3 Arkansas    321       
##  4 Arizona     370       
##  5 California  370       
##  6 Colorado    371       
##  7 Connecticut 464       
##  8 D.C.        1612      
##  9 Delaware    563       
## 10 Florida     357       
## # … with 43 more rows

That’s better!

Or, we can use a package to automatically fix all the names in a dataset.

# install.packages("janitor")
library(janitor)

doctors %>%
  clean_names()

## # A tibble: 53 x 2
##    state       physicians_per_100_000_population
##    <chr>       <chr>                            
##  1 Alaska      514                              
##  2 Alabama     331                              
##  3 Arkansas    321                              
##  4 Arizona     370                              
##  5 California  370                              
##  6 Colorado    371                              
##  7 Connecticut 464                              
##  8 D.C.        1612                             
##  9 Delaware    563                              
## 10 Florida     357                              
## # … with 43 more rows

With either of these approaches, you would need to overwrite the original data with the renamed data in order to use it. For the sake of having a short name, let’s use the rename() function.

doctors <- doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)

Searching for strange issues in variables

But, we’ve still got a variable type issue. When we run summary statistics, R complains because there is something weird in the data. Let’s try count()ing up the values in the data.

doctors %>%
  count(Physicians) %>%
  arrange(desc(Physicians))

## # A tibble: 49 x 2
##    Physicians     n
##    <chr>      <int>
##  1 N/A            1
##  2 644            1
##  3 575            1
##  4 563            1
##  5 514            1
##  6 510            1
##  7 506            1
##  8 504            1
##  9 485            1
## 10 478            2
## # … with 39 more rows

Now we see it! There’s a strange NA value that R doesn’t know how to deal with. We can actually add this into our data-reading-in code.

doctors <- read_csv("Physicians.csv", na = c("", NA, "N/A"))

## Parsed with column specification:
## cols(
##   State = col_character(),
##   `Physicians per 100,000 Population` = col_integer()
## )

Of course, we have to redo our renaming.

doctors <- doctors %>%
  rename(Physicians = `Physicians per 100,000 Population`)

doctors %>%
  skim(Physicians)

## Skim summary statistics
##  n obs: 53 
##  n variables: 2 
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##    variable missing complete  n   mean     sd  p0 p25 p50    p75 p100
##  Physicians       1       52 53 429.06 184.93 269 346 400 464.25 1612
##      hist
##  ▇▃▁▁▁▁▁▁

Much better!

The importance of variable types

doctors %>%
  summarise(mean(Physicians, na.rm=TRUE))

## # A tibble: 1 x 1
##   `mean(Physicians, na.rm = TRUE)`
##                              <dbl>
## 1                             429.

That makes sense, on average states have about 430 doctors. But what if we had used our data from read.csv()?

doctorsDot %>%
  summarize(mean(Physicians.per.100.000.Population, na.rm=TRUE))

## Warning in mean.default(x, ..., na.rm = na.rm): argument is not numeric or
## logical: returning NA

##   mean(Physicians.per.100.000.Population, na.rm = TRUE)
## 1                                                    NA

It doesn’t work because it’s a factor, so we might want to convert it to a numeric vector. The most intuitive thing (that doesn’t do what you want!!) would be

doctorsDot %>%
  summarize(mean(as.numeric(Physicians.per.100.000.Population), na.rm=TRUE))

##   mean(as.numeric(Physicians.per.100.000.Population), na.rm = TRUE)
## 1                                                          25.07547

Now that number doesn’t make sense, but it’s also not so weird it would really catch your attention. It turns out what R does when you run as.numeric() on a factor variable is it turns each “label” (like 514 for the number of doctors in Alaska) into a number, based on where it is in the order of the factor levels (so, 514 turns into 45).

It turns out you can get the appropriate results by nesting as.numeric() and as.character(),

doctorsDot %>%
  summarize(mean(as.numeric(as.character(Physicians.per.100.000.Population)), na.rm=TRUE))

## Warning in mean(as.numeric(as.character(Physicians.per.
## 100.000.Population)), : NAs introduced by coercion

## Warning in lazyeval::is_formula(x): NAs introduced by coercion

##   mean(...)
## 1  429.0577

but it’s much easier to just start with data from read_csv().

Filtering data

We might not want to keep all the data in our dataset. By looking at

doctors %>%
  pull(State)

##  [1] "Alaska"         "Alabama"        "Arkansas"       "Arizona"       
##  [5] "California"     "Colorado"       "Connecticut"    "D.C."          
##  [9] "Delaware"       "Florida"        "Georgia"        "Hawaii"        
## [13] "Iowa"           "Idaho"          "Illinois"       "Indiana"       
## [17] "Kansas"         "Kentucky"       "Louisiana"      "Massachusetts" 
## [21] "Maryland"       "Maine"          "Michigan"       "Minnesota"     
## [25] "Missouri"       "Mississippi"    "Montana"        "North Carolina"
## [29] "North Dakota"   "Nebraska"       "New Hampshire"  "New Jersey"    
## [33] "New Mexico"     "Nevada"         "New York"       "Ohio"          
## [37] "Oklahoma"       "Oregon"         "Pennsylvania"   "Puerto Rico"   
## [41] "Rhode Island"   "South Carolina" "South Dakota"   "Tennessee"     
## [45] "Texas"          "Utah"           "Virginia"       "Vermont"       
## [49] "Washington"     "Wisconsin"      "West Virginia"  "Wyoming"       
## [53] "All US"

I can see that All US is included, as is Puerto Rico. Let’s remove those.

doctors <- doctors %>% 
  filter(State != "All US", State != "Puerto Rico")
doctors %>%
  pull(State)

##  [1] "Alaska"         "Alabama"        "Arkansas"       "Arizona"       
##  [5] "California"     "Colorado"       "Connecticut"    "D.C."          
##  [9] "Delaware"       "Florida"        "Georgia"        "Hawaii"        
## [13] "Iowa"           "Idaho"          "Illinois"       "Indiana"       
## [17] "Kansas"         "Kentucky"       "Louisiana"      "Massachusetts" 
## [21] "Maryland"       "Maine"          "Michigan"       "Minnesota"     
## [25] "Missouri"       "Mississippi"    "Montana"        "North Carolina"
## [29] "North Dakota"   "Nebraska"       "New Hampshire"  "New Jersey"    
## [33] "New Mexico"     "Nevada"         "New York"       "Ohio"          
## [37] "Oklahoma"       "Oregon"         "Pennsylvania"   "Rhode Island"  
## [41] "South Carolina" "South Dakota"   "Tennessee"      "Texas"         
## [45] "Utah"           "Virginia"       "Vermont"        "Washington"    
## [49] "Wisconsin"      "West Virginia"  "Wyoming"

That did it.

Merging/joining data

Let’s load in some more data. This data is about police killings in different states

policekillings <- read_csv("policekillings.csv")

## Parsed with column specification:
## cols(
##   state = col_character(),
##   deaths = col_integer()
## )

skim(policekillings)

## Skim summary statistics
##  n obs: 47 
##  n variables: 2 
## 
## ── Variable type:character ────────────────────────────────────────────────
##  variable missing complete  n min max empty n_unique
##     state       0       47 47   2   2     0       47
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##  variable missing complete  n mean    sd p0 p25 p50  p75 p100     hist
##    deaths       0       47 47 9.94 12.58  1   4   7 10.5   74 ▇▂▁▁▁▁▁▁

It doesn’t have many data issues. But, say we wanted to merge it with our physicians data. We need a variable to “join on.” That is, something that R can use to match up which row in doctors goes with which row in policekillings. Unfortunately, doctors has state names written out and policekillings just has the state abbreviations. We need to change one to match the other.

policekillings <- policekillings %>%
  mutate(state = state.name[match(state, state.abb)])
skim(policekillings)

## Skim summary statistics
##  n obs: 47 
##  n variables: 2 
## 
## ── Variable type:character ────────────────────────────────────────────────
##  variable missing complete  n min max empty n_unique
##     state       1       46 47   4  14     0       46
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##  variable missing complete  n mean    sd p0 p25 p50  p75 p100     hist
##    deaths       0       47 47 9.94 12.58  1   4   7 10.5   74 ▇▂▁▁▁▁▁▁

This works because state.abb is a list of all the state abbreviations and state.name is a list of all the full state names. Both come built in to R.

Now, we can do some joining. After all, if we want to understand the relationship between physicians and police killings, we have to have both variables in the same data frame. The command that allows you to combine data frames is a “join”, which corresponds to the database operation called a JOIN.

When two database tables (or data frames) are joined, the rows of one table get matched to the rows of the other. Computers are exceptionally better suited for this task than humans, but the computer needs to be told what the criteria is for matching up the rows. The variable(s) upon which the matching is based are called a key. Both tables must contain the key columns, but there are variety of ways in which the matching can be done.

Joins can be complicated, so you may want to look at a visual explanation:

Before we start, let’s consider the dimensions of our two data frames.

dim(policekillings)

## [1] 47  2

dim(doctors)

## [1] 51  2

They don’t have the same number of rows, so we should think about how large we want our data to be at the end of our join. Do we want to keep all 51 rows in doctors and match elements from policekillings, filling with NA values where there aren’t matches? Or, are we more interested in policekillings and we want to just match the 47 states with elements from doctors?

In databases, the default JOIN type is INNER JOIN. In this case only the rows that are present in both data frames get returned. In R, I find I most often use a left_join, which retains all of the records from the first data frame, regardless of whether there is a match in the second data frame.

There are also A right joins, which retain all of the records from the second data frame, regardless of whether there is a match in the first data frame, and full joins, which contain all of the records from both data frames, regardless of whether there was a match. [Note that the number of rows returned can exceed the number of rows in either of the two original data frames.]

For this application, let’s try a left_join.

joinedData <-  left_join(policekillings, doctors)

## Error: `by` required, because the data sources have no common variables

This throws an error, because it is expecting there to be a variable with the same name in each data set. To get it to work, we need to specify which variables to join on.

joinedData <-  left_join(policekillings, doctors, by = c("state"= "State"))
skim(joinedData)

## Skim summary statistics
##  n obs: 47 
##  n variables: 3 
## 
## ── Variable type:character ────────────────────────────────────────────────
##  variable missing complete  n min max empty n_unique
##     state       1       46 47   4  14     0       46
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##    variable missing complete  n   mean    sd  p0 p25 p50   p75 p100
##      deaths       0       47 47   9.94 12.58   1   4   7  10.5   74
##  Physicians       1       46 47 404    79.1  269 342 394 460    644
##      hist
##  ▇▂▁▁▁▁▁▁
##  ▂▇▅▃▅▁▁▁

Notice that this dataset only has 47 observations, because there were only 47 in policekillings (the “left” in our left_join). We could have reversed the order of the datasets to get something slightly different.

joinedData <-  left_join(doctors, policekillings, by = c("State"= "state"))
skim(joinedData)

## Skim summary statistics
##  n obs: 51 
##  n variables: 3 
## 
## ── Variable type:character ────────────────────────────────────────────────
##  variable missing complete  n min max empty n_unique
##     State       0       51 51   4  14     0       51
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##    variable missing complete  n   mean     sd  p0 p25 p50    p75 p100
##      deaths       5       46 51  10.13  12.64   1   4   7  10.75   74
##  Physicians       0       51 51 431.84 185.66 269 350 400 464.5  1612
##      hist
##  ▇▂▁▁▁▁▁▁
##  ▇▃▁▁▁▁▁▁

Now, the data has 51 observations! (We could have gotten the same result by running right_join with the data specified the same way as our first join.)

Filtering and recoding data

For another example, consider this data about political opinions.

politics <- read_csv("politics.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   Left_Right_Wing = col_integer(),
##   BirthYear = col_integer(),
##   Income = col_integer(),
##   Religiousity = col_integer(),
##   Education = col_integer(),
##   Country = col_character(),
##   Location = col_integer(),
##   Gender = col_integer(),
##   Occupation = col_integer(),
##   Marital_Status = col_integer(),
##   Age = col_integer()
## )

skim(politics)

## Skim summary statistics
##  n obs: 3564 
##  n variables: 12 
## 
## ── Variable type:character ────────────────────────────────────────────────
##  variable missing complete    n min max empty n_unique
##   Country       0     3564 3564   6  14     0        4
## 
## ── Variable type:integer ──────────────────────────────────────────────────
##         variable missing complete    n     mean       sd   p0      p25
##              Age       0     3564 3564    48.23    14.18   20    37   
##        BirthYear       0     3564 3564  1967.77    14.18 1932  1958   
##        Education       0     3564 3564     4.99     1.74    2     4   
##           Gender       0     3564 3564     1.45     0.5     1     1   
##           Income       0     3564 3564     3.29     1.31    1     2   
##  Left_Right_Wing       0     3564 3564     4.74     2.2     0     3   
##         Location       0     3564 3564     2.25     1.11    1     1   
##   Marital_Status       0     3564 3564     1.99     1.34    1     1   
##       Occupation       0     3564 3564   433.64   225.17   10   241   
##     Religiousity       0     3564 3564     2.42     0.87    1     2   
##               X1       0     3564 3564 40399.72 42241.45 3958 15224.75
##      p50      p75   p100     hist
##     48      58        84 ▃▆▆▇▇▃▂▁
##   1968    1979      1996 ▁▃▅▇▇▆▆▂
##      4       6         9 ▁▂▇▂▁▁▂▁
##      1       2         2 ▇▁▁▁▁▁▁▆
##      3       4         5 ▃▆▁▇▁▇▁▇
##      5       6        10 ▂▁▂▇▂▂▁▁
##      2       3         4 ▇▁▆▁▁▅▁▅
##      1       4         4 ▇▁▁▁▁▁▁▃
##    410     532       960 ▂▇▇▃▆▂▅▂
##      3       3         4 ▃▁▆▁▁▇▁▂
##  43113.5 48222.25 470810 ▇▁▁▁▁▁▁▁

We might want to filter the data to only include the observations from Bulgaria. So, we could use a filter() to filter out the data.

politics <- politics %>%
  filter(Country == "Bulgaria")

This doesn’t work, and the error message is a little obscure. I had to google it! Looking at those results, I saw one that mentioned filter(), so I checked it out. It seems like the problem is empty variable names, which we have in this dataset.

names(politics)

##  [1] "X1"              "Left_Right_Wing" "BirthYear"      
##  [4] "Income"          "Religiousity"    "Education"      
##  [7] "Country"         "Location"        "Gender"         
## [10] "Occupation"      "Marital_Status"  "Age"

To fix this, we can do a rather strange renaming. Now, our filter works!

politics <- politics %>%
  rename(obs = )
skim(politics)
politics <- politics %>%
  filter(Country == "Bulgaria")

We also might want to recode Left_Right_Wing to be a binary variable.

politics <- politics %>%
  mutate(LRbinary = if_else(Left_Right_Wing>5, "right", "left"))
skim(politics)

Exporting the Data

Now that we’ve spent all this time getting our data into the right format, we want to write it to a file so that we never have to do this again! The opposite of read_csv is write_csv and it does exactly that.

write_csv(policekillings, "policekillings.csv")

The file should then show up in your “Files” window, where you can “Export” it to your local computer if you like.

For your project, I’d like to see all the data-cleaning code you used in order to arrive at your final data, but it can get annoying running that code over and over again. So, saving your clean data can allow you to quickly read it in and start working. It’s also a good way to share data with your teammates.

Closing down

Now, close your RStudio session and say Don’t Save to the environment question!