library(fivethirtyeight)
library(tidyverse)Case Study: Friday the 13th Effect
Task
Reproduce this figure from fivethirtyeight’s article Some People Are Too Superstitious To Have A Baby On Friday The 13th:

Data
In the fivethiryeight package there are two datasets containing birth data, but for now let’s just work with one, US_births_1994_2003. Note that since we have data from 1994-2003, our results may differ somewhat from the figure based on 1994-2014.
Your Turn 1
With your neighbour, brainstorm the steps needed to get the data in a form ready to make the plot.
US_births_1994_2003Some overviews of the data
Whole time series:
ggplot(US_births_1994_2003, aes(x = date, y = births)) +
geom_line()There is so much fluctuation it’s really hard to see what is going on.
Let’s try just looking at one year:
US_births_1994_2003 %>%
filter(year == 1994) %>%
ggplot(mapping = aes(x = date, y = births)) +
geom_line()Strong weekly pattern accounts for most variation.
Strategy
Use the figure as a guide for what the data should like to make the final plot. We want to end up with something like:
| day_of_week | avg_diff_13 |
|---|---|
| Mon | -2.686 |
| Tues | -1.378 |
| Wed | -3.274 |
| … | … |
Your Turn 2
Extract just the 6th, 13th and 20th of each month:
US_births_1994_2003 %>%
select(-date) Your Turn 3
Which arrangement is tidy?
Option 1:
| year | month | date_of_month | day_of_week | births |
|---|---|---|---|---|
| 1994 | 1 | 6 | Thurs | 11406 |
| 1994 | 1 | 13 | Thurs | 11212 |
| 1994 | 1 | 20 | Thurs | 11682 |
Option 2:
| year | month | day_of_week | 6 | 13 | 20 |
|---|---|---|---|---|---|
| 1994 | 1 | Thurs | 11406 | 11212 | 11682 |
(Hint: think about our next step “Find the percent difference between the 13th and the average of the 6th and 12th”. In which layout will this be easier using our tidy tools?)
Your Turn 4
Tidy the filtered data to have the days in columns.
US_births_1994_2003 %>%
select(-date) %>%
filter(date_of_month %in% c(6, 13, 20))Your Turn 5
Now use mutate() to add columns for:
- The average of the births on the 6th and 20th
- The percentage difference between the number of births on the 13th and the average of the 6th and 20th
US_births_1994_2003 %>%
select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births) A little additional exploring
Now we have a percent difference between the 13th and the 6th and 20th of each month, it’s probably worth exploring a little (at the very least to check our calculations seem reasonable).
To make it a little easier let’s assign our current data to a variable
births_diff_13 <- US_births_1994_2003 %>%
select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births) %>%
mutate(
avg_6_20 = (`6` + `20`)/2,
diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
)Then take a look
births_diff_13 %>%
ggplot(mapping = aes(day_of_week, diff_13)) +
geom_point()Looks like we are on the right path. There’s a big outlier one Monday
births_diff_13 %>%
filter(day_of_week == "Mon", diff_13 > 10)Seem’s to be driven but a particularly low number of births on the 6th of Sep 1999. Maybe a holiday effect? Labour Day was of the 6th of Sep that year.
Your Turn 6
Summarize each day of the week to have mean of diff_13.
Then, recreate the fivethirtyeight plot.
US_births_1994_2003 %>%
select(-date) %>%
filter(date_of_month %in% c(6, 13, 20)) %>%
spread(date_of_month, births) %>%
mutate(
avg_6_20 = (`6` + `20`)/2,
diff_13 = (`13` - avg_6_20) / avg_6_20 * 100
) Extra Challenges
If you wanted to use the
US_births_2000_2014data instead, what would you need to change in the pipeline? How about using bothUS_births_1994_2003andUS_births_2000_2014?Try not removing the
datecolumn. At what point in the pipeline does it cause problems? Why?Can you come up with an alternative way to investigate the Friday the 13th effect? Try it out!
Takeaways
The power of the tidyverse comes from being able to easily combine functions that do simple things well.