library(fivethirtyeight)
library(tidyverse)
Case Study: Friday the 13th Effect
Task
Reproduce this figure from fivethirtyeight’s article Some People Are Too Superstitious To Have A Baby On Friday The 13th:
Data
In the fivethiryeight
package there are two datasets containing birth data, but for now let’s just work with one, US_births_1994_2003
. Note that since we have data from 1994-2003, our results may differ somewhat from the figure based on 1994-2014.
Your Turn 1
With your neighbour, brainstorm the steps needed to get the data in a form ready to make the plot.
US_births_1994_2003
Some overviews of the data
Whole time series:
ggplot(US_births_1994_2003, aes(x = date, y = births)) +
geom_line()
There is so much fluctuation it’s really hard to see what is going on.
Let’s try just looking at one year:
%>%
US_births_1994_2003 filter(year == 1994) %>%
ggplot(mapping = aes(x = date, y = births)) +
geom_line()
Strong weekly pattern accounts for most variation.
Strategy
Use the figure as a guide for what the data should like to make the final plot. We want to end up with something like:
day_of_week | avg_diff_13 |
---|---|
Mon | -2.686 |
Tues | -1.378 |
Wed | -3.274 |
… | … |
Your Turn 2
Extract just the 6th, 13th and 20th of each month:
%>%
US_births_1994_2003 select(-date)
Your Turn 3
Which arrangement is tidy?
Option 1:
year | month | date_of_month | day_of_week | births |
---|---|---|---|---|
1994 | 1 | 6 | Thurs | 11406 |
1994 | 1 | 13 | Thurs | 11212 |
1994 | 1 | 20 | Thurs | 11682 |
Option 2:
year | month | day_of_week | 6 | 13 | 20 |
---|---|---|---|---|---|
1994 | 1 | Thurs | 11406 | 11212 | 11682 |
(Hint: think about our next step “Find the percent difference between the 13th and the average of the 6th and 12th”. In which layout will this be easier using our tidy tools?)
Your Turn 4
Tidy the filtered data to have the days in columns.
%>%
US_births_1994_2003 select(-date) %>%
filter(date_of_month %in% c(6, 13, 20))
Your Turn 5
Now use mutate()
to add columns for:
- The average of the births on the 6th and 20th
- The percentage difference between the number of births on the 13th and the average of the 6th and 20th
%>%
US_births_1994_2003 select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births)
A little additional exploring
Now we have a percent difference between the 13th and the 6th and 20th of each month, it’s probably worth exploring a little (at the very least to check our calculations seem reasonable).
To make it a little easier let’s assign our current data to a variable
<- US_births_1994_2003 %>%
births_diff_13 select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births) %>%
mutate(
avg_6_20 = (`6` + `20`)/2,
diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
)
Then take a look
%>%
births_diff_13 ggplot(mapping = aes(day_of_week, diff_13)) +
geom_point()
Looks like we are on the right path. There’s a big outlier one Monday
%>%
births_diff_13 filter(day_of_week == "Mon", diff_13 > 10)
Seem’s to be driven but a particularly low number of births on the 6th of Sep 1999. Maybe a holiday effect? Labour Day was of the 6th of Sep that year.
Your Turn 6
Summarize each day of the week to have mean of diff_13.
Then, recreate the fivethirtyeight plot.
%>%
US_births_1994_2003 select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births) %>%
mutate(
avg_6_20 = (`6` + `20`)/2,
diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
)
Extra Challenges
If you wanted to use the
US_births_2000_2014
data instead, what would you need to change in the pipeline? How about using bothUS_births_1994_2003
andUS_births_2000_2014
?Try not removing the
date
column. At what point in the pipeline does it cause problems? Why?Can you come up with an alternative way to investigate the Friday the 13th effect? Try it out!
Takeaways
The power of the tidyverse comes from being able to easily combine functions that do simple things well.