library(tidyverse)
library(Stat2Data)
library(skimr)
Now is a good time to clean up your Environment in R. I recommend clicking the broom icon on your Environment tab and removing all the objects from your environment. If you’re worried about a particular dataset, you can skip down to the bottom of the lab and learn how to use write_csv()
to save data out of R. But in general, there is no danger to cleaning up your Environment.
Now, quit R and when it asks if you want to save your workspace image, select Yes. This is the last time I want you to select that! In general, you want to select No because you don’t want to preserve your environment.
Sometimes, the data you get is in the wrong “shape.” Recall that at the beginning of the semester we talked about “tidy” data– data where every row is an observation at the same observational level, and every column is a variable. I usually think about this as rectangular data, but people will also sometimes call tidy data “tall” and un-tidy data “wide.”
We actually had a great example of this in problem 5.24, on baseball data
data(FantasyBaseball)
head(FantasyBaseball)
## Round DJ AR BK JW TS RL DR MF
## 1 1 42 35 49 104 15 40 26 101
## 2 2 84 26 65 101 17 143 43 16
## 3 3 21 95 115 53 66 103 113 88
## 4 4 99 41 66 123 6 144 16 79
## 5 5 25 129 53 144 6 162 113 48
## 6 6 89 62 80 247 17 55 369 2
Here, each row is a round and each variable is a person. That doesn’t make a ton of sense. We would prefer to think about each row as a person, with variables for the round and their time for that round. We can use the gather
/spread
functions from tidyr
to fix this.
I usually don’t get this right on the first try, so I run a few variations in my Console and just keep the one that looks the way I want. To remember which function I want, I look at the documentation
?spread
?gather
The documentation for gather
says it “takes multiple columns and collapses into key-value pairs. You use gather() when you notice that you have columns that are not variables” That’s what we want here.
gather(FantasyBaseball, Player, Time, Round) #nope! That's even worse than before!
gather(FantasyBaseball, Time, Round) #almost right
gather(FantasyBaseball, Player, Time, -Round) #what we want
Once I’ve identified the way I want my data to look, I can overwrite the original data with my gathered version,
FantasyBaseball <- FantasyBaseball %>%
gather(Player, Time, -Round)
head(FantasyBaseball)
## Round Player Time
## 1 1 DJ 42
## 2 2 DJ 84
## 3 3 DJ 21
## 4 4 DJ 99
## 5 5 DJ 25
## 6 6 DJ 89
Now it looks the way I would like!
Another task you may want to do for your project is take a random sample of your data, either because you want to have testing and training data, or you just need a smaller dataset that works on your laptop.
R is great at taking random samples. In base R, the sample()
function lets you take samples.
Again, we can read the documentation
?sample
Before we do any sampling, however, we should “set a seed” for our randomness, so our work is reproducible. If you don’t do this, the next time you or one of your group members runs the same code, you will get a different result.
Let’s just see what this looks like in a simple case
sample(1:10, size = 5)
## [1] 9 7 1 5 8
sample(1:10, size = 5)
## [1] 3 7 5 4 1
Versus if we set a seed
set.seed(123)
sample(1:10, size = 5)
## [1] 3 8 4 7 6
set.seed(123)
sample(1:10, size = 5)
## [1] 3 8 4 7 6
You can put any number in the set.seed()
function. I typically use 123, 42, or my birthday. You can put whatever you like.
Now, let’s apply this to taking a sample from data. Let’s say that for some reason we wanted a sample from the FantasyBaseball
data.
skim(FantasyBaseball)
## Skim summary statistics
## n obs: 192
## n variables: 3
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## Player 0 192 192 2 2 0 8
##
## ── Variable type:integer ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## Round 0 192 192 12.5 6.94 1 6.75 12.5 18.25 24
## Time 0 192 192 72.52 71.64 1 21 51.5 101.25 436
## hist
## ▇▇▇▇▇▇▇▇
## ▇▃▂▁▁▁▁▁
It has 192 rows, so maybe I want to sample 50% of them or 96 rows. There are several ways to do this. One is to use the base R sample()
function, and then slice()
those rows out.
set.seed(123)
rows <- sample(1:192, size=96)
mini_baseball <- FantasyBaseball %>%
slice(rows)
skim(mini_baseball)
## Skim summary statistics
## n obs: 96
## n variables: 3
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## Player 0 96 96 2 2 0 8
##
## ── Variable type:integer ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## Round 0 96 96 12.62 6.98 1 7 12 18.25 24 ▆▆▇▆▆▆▅▇
## Time 0 96 96 75.55 72.47 1 26 50 103.5 399 ▇▃▂▁▁▁▁▁
Now, my sample has all the same variables as my original data, but it’s half the size. You can apply this same strategy to take 25%, 10%, or any other size of sample you want to do.
Of course, the tidyverse also has some handy convenience functions that abstract away some of the sampling. Let’s try sample_n()
and sample_frac()
.
mini_baseball2 <- FantasyBaseball %>%
sample_n(size=96)
mini_baseball3 <- FantasyBaseball %>%
sample_frac(size=0.5)
doctorsDot <- read.csv("Physicians.csv")
skim(doctorsDot)
## Skim summary statistics
## n obs: 53
## n variables: 2
##
## ── Variable type:factor ───────────────────────────────────────────────────
## variable missing complete n n_unique
## Physicians.per.100.000.Population 0 53 53 49
## State 0 53 53 53
## top_counts ordered
## 361: 2, 370: 2, 400: 2, 478: 2 FALSE
## Ala: 1, Ala: 1, All: 1, Ari: 1 FALSE
This example illustrates one issue with using read.csv()
– it wants to make everything into factors, which we don’t want! I’ll keep that data around to show something later on.
doctors <- read_csv("Physicians.csv")
## Parsed with column specification:
## cols(
## State = col_character(),
## `Physicians per 100,000 Population` = col_character()
## )
skim(doctors)
## Skim summary statistics
## n obs: 53
## n variables: 2
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty
## Physicians per 100,000 Population 0 53 53 3 4 0
## State 0 53 53 4 14 0
## n_unique
## 49
## 53
This is better, because it’s making the variables character vectors, which we can work with more easily.
The variable names in this dataset are going to be hard to use,
names(doctors)
## [1] "State" "Physicians per 100,000 Population"
In order to run something like skim()
, we’d have to use backtics to write out the variable name.
doctors %>%
skim(`Physicians per 100,000 Population`)
## Skim summary statistics
## n obs: 53
## n variables: 2
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty
## Physicians per 100,000 Population 0 53 53 3 4 0
## n_unique
## 49
We can manually rename that variable,
doctors %>%
rename(Physicians = `Physicians per 100,000 Population`)
## # A tibble: 53 x 2
## State Physicians
## <chr> <chr>
## 1 Alaska 514
## 2 Alabama 331
## 3 Arkansas 321
## 4 Arizona 370
## 5 California 370
## 6 Colorado 371
## 7 Connecticut 464
## 8 D.C. 1612
## 9 Delaware 563
## 10 Florida 357
## # … with 43 more rows
That’s better!
Or, we can use a package to automatically fix all the names in a dataset.
# install.packages("janitor")
library(janitor)
doctors %>%
clean_names()
## # A tibble: 53 x 2
## state physicians_per_100_000_population
## <chr> <chr>
## 1 Alaska 514
## 2 Alabama 331
## 3 Arkansas 321
## 4 Arizona 370
## 5 California 370
## 6 Colorado 371
## 7 Connecticut 464
## 8 D.C. 1612
## 9 Delaware 563
## 10 Florida 357
## # … with 43 more rows
With either of these approaches, you would need to overwrite the original data with the renamed data in order to use it. For the sake of having a short name, let’s use the rename()
function.
doctors <- doctors %>%
rename(Physicians = `Physicians per 100,000 Population`)
But, we’ve still got a variable type issue. When we run summary statistics, R complains because there is something weird in the data. Let’s try count()
ing up the values in the data.
doctors %>%
count(Physicians) %>%
arrange(desc(Physicians))
## # A tibble: 49 x 2
## Physicians n
## <chr> <int>
## 1 N/A 1
## 2 644 1
## 3 575 1
## 4 563 1
## 5 514 1
## 6 510 1
## 7 506 1
## 8 504 1
## 9 485 1
## 10 478 2
## # … with 39 more rows
Now we see it! There’s a strange NA value that R doesn’t know how to deal with. We can actually add this into our data-reading-in code.
doctors <- read_csv("Physicians.csv", na = c("", NA, "N/A"))
## Parsed with column specification:
## cols(
## State = col_character(),
## `Physicians per 100,000 Population` = col_integer()
## )
Of course, we have to redo our renaming.
doctors <- doctors %>%
rename(Physicians = `Physicians per 100,000 Population`)
doctors %>%
skim(Physicians)
## Skim summary statistics
## n obs: 53
## n variables: 2
##
## ── Variable type:integer ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## Physicians 1 52 53 429.06 184.93 269 346 400 464.25 1612
## hist
## ▇▃▁▁▁▁▁▁
Much better!
doctors %>%
summarise(mean(Physicians, na.rm=TRUE))
## # A tibble: 1 x 1
## `mean(Physicians, na.rm = TRUE)`
## <dbl>
## 1 429.
That makes sense, on average states have about 430 doctors. But what if we had used our data from read.csv()
?
doctorsDot %>%
summarize(mean(Physicians.per.100.000.Population, na.rm=TRUE))
## Warning in mean.default(x, ..., na.rm = na.rm): argument is not numeric or
## logical: returning NA
## mean(Physicians.per.100.000.Population, na.rm = TRUE)
## 1 NA
It doesn’t work because it’s a factor, so we might want to convert it to a numeric vector. The most intuitive thing (that doesn’t do what you want!!) would be
doctorsDot %>%
summarize(mean(as.numeric(Physicians.per.100.000.Population), na.rm=TRUE))
## mean(as.numeric(Physicians.per.100.000.Population), na.rm = TRUE)
## 1 25.07547
Now that number doesn’t make sense, but it’s also not so weird it would really catch your attention. It turns out what R does when you run as.numeric()
on a factor variable is it turns each “label” (like 514 for the number of doctors in Alaska) into a number, based on where it is in the order of the factor levels (so, 514 turns into 45).
It turns out you can get the appropriate results by nesting as.numeric()
and as.character()
,
doctorsDot %>%
summarize(mean(as.numeric(as.character(Physicians.per.100.000.Population)), na.rm=TRUE))
## Warning in mean(as.numeric(as.character(Physicians.per.
## 100.000.Population)), : NAs introduced by coercion
## Warning in lazyeval::is_formula(x): NAs introduced by coercion
## mean(...)
## 1 429.0577
but it’s much easier to just start with data from read_csv()
.
We might not want to keep all the data in our dataset. By looking at
doctors %>%
pull(State)
## [1] "Alaska" "Alabama" "Arkansas" "Arizona"
## [5] "California" "Colorado" "Connecticut" "D.C."
## [9] "Delaware" "Florida" "Georgia" "Hawaii"
## [13] "Iowa" "Idaho" "Illinois" "Indiana"
## [17] "Kansas" "Kentucky" "Louisiana" "Massachusetts"
## [21] "Maryland" "Maine" "Michigan" "Minnesota"
## [25] "Missouri" "Mississippi" "Montana" "North Carolina"
## [29] "North Dakota" "Nebraska" "New Hampshire" "New Jersey"
## [33] "New Mexico" "Nevada" "New York" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania" "Puerto Rico"
## [41] "Rhode Island" "South Carolina" "South Dakota" "Tennessee"
## [45] "Texas" "Utah" "Virginia" "Vermont"
## [49] "Washington" "Wisconsin" "West Virginia" "Wyoming"
## [53] "All US"
I can see that All US
is included, as is Puerto Rico
. Let’s remove those.
doctors <- doctors %>%
filter(State != "All US", State != "Puerto Rico")
doctors %>%
pull(State)
## [1] "Alaska" "Alabama" "Arkansas" "Arizona"
## [5] "California" "Colorado" "Connecticut" "D.C."
## [9] "Delaware" "Florida" "Georgia" "Hawaii"
## [13] "Iowa" "Idaho" "Illinois" "Indiana"
## [17] "Kansas" "Kentucky" "Louisiana" "Massachusetts"
## [21] "Maryland" "Maine" "Michigan" "Minnesota"
## [25] "Missouri" "Mississippi" "Montana" "North Carolina"
## [29] "North Dakota" "Nebraska" "New Hampshire" "New Jersey"
## [33] "New Mexico" "Nevada" "New York" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania" "Rhode Island"
## [41] "South Carolina" "South Dakota" "Tennessee" "Texas"
## [45] "Utah" "Virginia" "Vermont" "Washington"
## [49] "Wisconsin" "West Virginia" "Wyoming"
That did it.
Let’s load in some more data. This data is about police killings in different states
policekillings <- read_csv("policekillings.csv")
## Parsed with column specification:
## cols(
## state = col_character(),
## deaths = col_integer()
## )
skim(policekillings)
## Skim summary statistics
## n obs: 47
## n variables: 2
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## state 0 47 47 2 2 0 47
##
## ── Variable type:integer ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## deaths 0 47 47 9.94 12.58 1 4 7 10.5 74 ▇▂▁▁▁▁▁▁
It doesn’t have many data issues. But, say we wanted to merge it with our physicians data. We need a variable to “join on.” That is, something that R can use to match up which row in doctors
goes with which row in policekillings
. Unfortunately, doctors
has state names written out and policekillings
just has the state abbreviations. We need to change one to match the other.
policekillings <- policekillings %>%
mutate(state = state.name[match(state, state.abb)])
skim(policekillings)
## Skim summary statistics
## n obs: 47
## n variables: 2
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## state 1 46 47 4 14 0 46
##
## ── Variable type:integer ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100 hist
## deaths 0 47 47 9.94 12.58 1 4 7 10.5 74 ▇▂▁▁▁▁▁▁
This works because state.abb
is a list of all the state abbreviations and state.name
is a list of all the full state names. Both come built in to R.
Now, we can do some joining. After all, if we want to understand the relationship between physicians and police killings, we have to have both variables in the same data frame. The command that allows you to combine data frames is a “join”, which corresponds to the database operation called a JOIN
.
When two database tables (or data frames) are joined, the rows of one table get matched to the rows of the other. Computers are exceptionally better suited for this task than humans, but the computer needs to be told what the criteria is for matching up the rows. The variable(s) upon which the matching is based are called a key. Both tables must contain the key columns, but there are variety of ways in which the matching can be done.
Joins can be complicated, so you may want to look at a visual explanation:
Before we start, let’s consider the dimensions of our two data frames.
dim(policekillings)
## [1] 47 2
dim(doctors)
## [1] 51 2
They don’t have the same number of rows, so we should think about how large we want our data to be at the end of our join. Do we want to keep all 51 rows in doctors
and match elements from policekillings
, filling with NA values where there aren’t matches? Or, are we more interested in policekillings
and we want to just match the 47 states with elements from doctors
?
In databases, the default JOIN type is INNER JOIN. In this case only the rows that are present in both data frames get returned. In R, I find I most often use a left_join
, which retains all of the records from the first data frame, regardless of whether there is a match in the second data frame.
There are also A right joins, which retain all of the records from the second data frame, regardless of whether there is a match in the first data frame, and full joins, which contain all of the records from both data frames, regardless of whether there was a match. [Note that the number of rows returned can exceed the number of rows in either of the two original data frames.]
For this application, let’s try a left_join
.
joinedData <- left_join(policekillings, doctors)
## Error: `by` required, because the data sources have no common variables
This throws an error, because it is expecting there to be a variable with the same name in each data set. To get it to work, we need to specify which variables to join on.
joinedData <- left_join(policekillings, doctors, by = c("state"= "State"))
skim(joinedData)
## Skim summary statistics
## n obs: 47
## n variables: 3
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## state 1 46 47 4 14 0 46
##
## ── Variable type:integer ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## deaths 0 47 47 9.94 12.58 1 4 7 10.5 74
## Physicians 1 46 47 404 79.1 269 342 394 460 644
## hist
## ▇▂▁▁▁▁▁▁
## ▂▇▅▃▅▁▁▁
Notice that this dataset only has 47 observations, because there were only 47 in policekillings
(the “left” in our left_join
). We could have reversed the order of the datasets to get something slightly different.
joinedData <- left_join(doctors, policekillings, by = c("State"= "state"))
skim(joinedData)
## Skim summary statistics
## n obs: 51
## n variables: 3
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## State 0 51 51 4 14 0 51
##
## ── Variable type:integer ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25 p50 p75 p100
## deaths 5 46 51 10.13 12.64 1 4 7 10.75 74
## Physicians 0 51 51 431.84 185.66 269 350 400 464.5 1612
## hist
## ▇▂▁▁▁▁▁▁
## ▇▃▁▁▁▁▁▁
Now, the data has 51 observations! (We could have gotten the same result by running right_join
with the data specified the same way as our first join.)
For another example, consider this data about political opinions.
politics <- read_csv("politics.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_integer(),
## Left_Right_Wing = col_integer(),
## BirthYear = col_integer(),
## Income = col_integer(),
## Religiousity = col_integer(),
## Education = col_integer(),
## Country = col_character(),
## Location = col_integer(),
## Gender = col_integer(),
## Occupation = col_integer(),
## Marital_Status = col_integer(),
## Age = col_integer()
## )
skim(politics)
## Skim summary statistics
## n obs: 3564
## n variables: 12
##
## ── Variable type:character ────────────────────────────────────────────────
## variable missing complete n min max empty n_unique
## Country 0 3564 3564 6 14 0 4
##
## ── Variable type:integer ──────────────────────────────────────────────────
## variable missing complete n mean sd p0 p25
## Age 0 3564 3564 48.23 14.18 20 37
## BirthYear 0 3564 3564 1967.77 14.18 1932 1958
## Education 0 3564 3564 4.99 1.74 2 4
## Gender 0 3564 3564 1.45 0.5 1 1
## Income 0 3564 3564 3.29 1.31 1 2
## Left_Right_Wing 0 3564 3564 4.74 2.2 0 3
## Location 0 3564 3564 2.25 1.11 1 1
## Marital_Status 0 3564 3564 1.99 1.34 1 1
## Occupation 0 3564 3564 433.64 225.17 10 241
## Religiousity 0 3564 3564 2.42 0.87 1 2
## X1 0 3564 3564 40399.72 42241.45 3958 15224.75
## p50 p75 p100 hist
## 48 58 84 ▃▆▆▇▇▃▂▁
## 1968 1979 1996 ▁▃▅▇▇▆▆▂
## 4 6 9 ▁▂▇▂▁▁▂▁
## 1 2 2 ▇▁▁▁▁▁▁▆
## 3 4 5 ▃▆▁▇▁▇▁▇
## 5 6 10 ▂▁▂▇▂▂▁▁
## 2 3 4 ▇▁▆▁▁▅▁▅
## 1 4 4 ▇▁▁▁▁▁▁▃
## 410 532 960 ▂▇▇▃▆▂▅▂
## 3 3 4 ▃▁▆▁▁▇▁▂
## 43113.5 48222.25 470810 ▇▁▁▁▁▁▁▁
We might want to filter the data to only include the observations from Bulgaria. So, we could use a filter()
to filter out the data.
politics <- politics %>%
filter(Country == "Bulgaria")
This doesn’t work, and the error message is a little obscure. I had to google it! Looking at those results, I saw one that mentioned filter()
, so I checked it out. It seems like the problem is empty variable names, which we have in this dataset.
names(politics)
## [1] "X1" "Left_Right_Wing" "BirthYear"
## [4] "Income" "Religiousity" "Education"
## [7] "Country" "Location" "Gender"
## [10] "Occupation" "Marital_Status" "Age"
To fix this, we can do a rather strange renaming. Now, our filter works!
politics <- politics %>%
rename(obs = )
skim(politics)
politics <- politics %>%
filter(Country == "Bulgaria")
We also might want to recode Left_Right_Wing
to be a binary variable.
politics <- politics %>%
mutate(LRbinary = if_else(Left_Right_Wing>5, "right", "left"))
skim(politics)
Now that we’ve spent all this time getting our data into the right format, we want to write it to a file so that we never have to do this again! The opposite of read_csv
is write_csv
and it does exactly that.
write_csv(policekillings, "policekillings.csv")
The file should then show up in your “Files” window, where you can “Export” it to your local computer if you like.
For your project, I’d like to see all the data-cleaning code you used in order to arrive at your final data, but it can get annoying running that code over and over again. So, saving your clean data can allow you to quickly read it in and start working. It’s also a good way to share data with your teammates.
Now, close your RStudio session and say Don’t Save to the environment question!