require(oilabs)
require(mosaic)
deaths<- read.csv("deaths.csv")
deaths<- mutate(deaths, mcd_percapita=mcd/pop)
deaths<- mutate(deaths, colleges_percapita=colleges/pop)
deaths<- mutate(deaths, athletes_percapita=athletes/pop)
Abstract
The purpose of this study is to determine whether there are any correlations between various causes of death and different variables. By examining different variables and their effect on ways of dying, we can better understand how real world factors contribute to death. This study examined fifty states in the United States (US), as well as the District of Columbia (D.C.), using multiple linear regression models to determine any statistical significance between the variables and death rates. This model showed that McDonald’s per capita, poverty rate, and colleges per capita all have a significant effect on total death rate (meaning p<0.05), while the variables with a significant effect on different ways of dying were varied.
Introduction
There are many different ways of dying. Some of the most prevalent causes of death are heart disease, cancer, stroke, vehicle accidents, and accidental death. In this study, we wanted to explore how different variables, such as poverty rate, obesity rate, population density and more, had a relationship, if any, between the different causes of death. The research question is: are there any correlations between ways of dying and different variables?
In this observational study, we took death rate data for 11 different ways of dying and also data from 7 variables in each state in the US, as well as D.C. in 2013. We then created and analyzed regression models to look for any significant relationship (p<0.05) between the various variables and causes of death.
Data
Data was collected from the Center for Disease Control and Prevention (CDC) on the number of deaths from different causes in the year 2013 for all 50 states as well as the D.C.. The total death rate (number of deaths/population for each state) and the percentage of deaths from different causes were the response variables, while the explanatory variables were population density, number of McDonald’s per capita, poverty rate, number of colleges per capita, high school graduation rate, obesity rate, and the number of 2014 olympic athletes per capita. These were gathered separately from a number of sources (see references). The population of each state was controlled for by representing the variables as rates or per capita data.
Results
Multiple Linear Regression Models:
deathrate <- lm(deathrate~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)
alcohol <- lm(alcohol_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)
murder <- lm(murder_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)
heartdisease <- lm(heartdisease_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)
accidents <- lm(accident_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)
vehicle <- lm(vehicle_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)
flu_pneumonia <- lm(flupneumonia_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)
alzheimers <- lm(alzheimer_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)
suicide <- lm(suicide_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)
hiv <- lm(hiv_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)
drugs <- lm(drug_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)
diabetes <- lm(diabetes_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)
Best fit multiple linear regression models using backwards elimination:
deathrate_bf <- lm(deathrate~pop_density + mcd_percapita + poverty_perc + colleges_percapita, data=deaths)
alcohol_bf <- lm(alcohol_perc~ mcd_percapita, data=deaths)
murder_bf <- lm(murder_perc~pop_density + hsgrad_perc + obesity_perc, data=deaths)
heartdisease_bf <- lm(heartdisease_perc~pop_density, data=deaths)
accidents_bf <- lm(accident_perc~pop_density + mcd_percapita + colleges_percapita + hsgrad_perc, data=deaths)
vehicle_bf <- lm(vehicle_perc~pop_density + poverty_perc + athletes_percapita, data=deaths)
flu_pneumonia_bf <- lm(flupneumonia_perc~obesity_perc, data=deaths)
alzheimers_bf <- lm(alzheimer_perc~athletes_percapita, data=deaths)
suicide_bf <- lm(suicide_perc~pop_density + poverty_perc + hsgrad_perc, data=deaths)
hiv_bf <- lm(hiv_perc~pop_density + colleges_percapita + hsgrad_perc, data=deaths)
drugs_bf <- lm(drug_perc~poverty_perc + hsgrad_perc, data=deaths)
diabetes_bf <- lm(diabetes_perc~pop_density + mcd_percapita + poverty_perc, data=deaths)
The best fit linear models usually had a higher R^2 value than their equivalent linear models accounting for all the variables. This means the best fit models explain more of the variability.
The linear model for the total death rate is:
Total death rate = -2.13e-03 - 1.67e-07(population density) + 6.36e01(McD per capita) + 1.97e-04(poverty rate) + 3.56e01(colleges per capita) + 3.89e-05(high school graduation rate) + 5.25e-06(obesity rate) + 3.87e01(olympic athletes per capita)
This was tested with the data from Massachusetts to get:
Total death rate = -2.13e-03 - 1.67e-07(858) + 6.36e01(4.43e-05) + 1.97e-04(11.6) + 3.56e01(3.89e-05) + 3.89e-05(86) + 5.25e-06(28.2) + 3.87e01(1.49e-07) = 0.0077
Which is quite close the the value found in the data of 0.0081.
Total Death Rate Analysis
Checking Conditions for Multiple Linear Regression Model using L.I.N.E.
plot(deathrate)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
histogram(~deathrate, data=deaths, nint=15, fit="normal")
histogram(~residuals, data=deathrate, nint=15, fit="normal")
As shown by the residuals vs fitted plot, we see that linearity, independence, and variance of residuals are met. Although the fitted line curves up on the left side, that can be attributed to lack of data. The residuals seem to be roughly distributed evenly above and below. To determine normality, we looked at the Q-Q plot. Since the points generally follow the line, with no curving upwards at the beginning or end, we concluded that the data was not skewed and was normal. There were two outliers at the bottom that were deemed insignificant. The histogram of the residuals show that the residuals follow a normal curve while the histogram of the death rate shows that the data is not normal, although it is not skewed.
summary(deathrate)
##
## Call:
## lm(formula = deathrate ~ pop_density + mcd_percapita + poverty_perc +
## colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita,
## data = deaths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0022864 -0.0005340 -0.0001098 0.0005998 0.0017454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.128e-03 2.649e-03 -0.803 0.426326
## pop_density -1.671e-07 1.056e-07 -1.582 0.120921
## mcd_percapita 6.363e+01 1.863e+01 3.416 0.001400 **
## poverty_perc 1.965e-04 5.562e-05 3.533 0.000996 ***
## colleges_percapita 3.560e+01 1.704e+01 2.089 0.042651 *
## hsgrad_perc 3.888e-05 2.639e-05 1.473 0.147976
## obesity_perc 5.254e-06 4.700e-05 0.112 0.911514
## athletes_percapita 3.870e+01 3.950e+01 0.980 0.332705
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.0009284 on 43 degrees of freedom
## Multiple R-squared: 0.5471, Adjusted R-squared: 0.4733
## F-statistic: 7.419 on 7 and 43 DF, p-value: 7.662e-06
Interpreting the Coefficients:
Holding all else constant, with every increase of 1 McDonald’s per capita, we would expect the total death rate to increase by 63.63. That would equate to a lot of McDonalds’, so in more practical sense, with every increase of 0.01 McDonald’s per person we would expect the death rate to increase by 0.6363. Holding all else constant, with every increase of 1 percent in the amount of the population living in poverty we would expect to see a 1.965*10^-4 percent increase in the total death rate. Lastly, holding all else constant, for every increase of one college per capita, we would expect to see an increase in the total death rate of 35.6. Once again, in a more practical approach, with an increase of 0.01 colleges per capita would expect to see the death rate increase by 0.356.
Vehicle Accident Analysis
plot(vehicle)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
histogram(~vehicle_perc, data=deaths, nint=15, fit="normal")
histogram(~residuals, data=vehicle, nint=15, fit="normal")
Checking Conditions for Multiple Linear Regression Model using L.I.N.E.
Looking at the residuals vs fitted plot, the independence, variance, and linearity conditions seem to be met , however there is an outlier on the far left, which is the District of Columbia, showing that it has 10x less vehicle accident deaths than the other states. In the normal Q-Q plot, there are many outliers on the right side, indicating that the normality condition is not met. The histogram plots also confirms that the normality condition is not met because the response variable seems to be right skewed, and the residuals also seem to be right skewed.
summary(vehicle)
##
## Call:
## lm(formula = vehicle_perc ~ pop_density + mcd_percapita + poverty_perc +
## colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita,
## data = deaths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0062033 -0.0019941 -0.0002799 0.0015041 0.0099007
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.781e-03 1.000e-02 0.578 0.56636
## pop_density -1.013e-06 3.988e-07 -2.541 0.01476 *
## mcd_percapita -3.858e+01 7.035e+01 -0.548 0.58622
## poverty_perc 5.836e-04 2.100e-04 2.778 0.00806 **
## colleges_percapita 1.721e+01 6.436e+01 0.267 0.79038
## hsgrad_perc -4.120e-05 9.967e-05 -0.413 0.68140
## obesity_perc 1.566e-04 1.775e-04 0.882 0.38251
## athletes_percapita 2.840e+02 1.492e+02 1.904 0.06363 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.003506 on 43 degrees of freedom
## Multiple R-squared: 0.3897, Adjusted R-squared: 0.2903
## F-statistic: 3.922 on 7 and 43 DF, p-value: 0.00217
Interpreting the coefficients:
Holding all else constant, with every increase of 1 person per square mile in population density, we would expect the proportion of deaths from vehicle accidents to decrease by 1.013 * 10^-6, and for every increase of 1 percent of the population living in poverty we would expect the proportion of deaths from vehicle accidents to increase by 5.836*10^-4.
Checking for correlation between explanatory variables:
pairs(~pop_density+mcd_percapita+poverty_perc+colleges_percapita+hsgrad_perc+obesity_perc+athletes_percapita,data=deaths,main="Explanatory Variables")
Poverty rate and high school graduation rate appear to be negatively correlated, and obesity rate and McDonald’s per capita look slightly positively correlated. However, since we controlled for all other variables in this analysis, we do not think that this has a significant impact on the results.
Conclusion
The goal of this project was to determine the correlations, if any, between the different death rates and seven variables (number of olympic athletes per capita, obesity rate, high school graduation rate, colleges per capita, poverty rate, McDonald’s per capita, population density). Our findings showed that the variables significant (p<0.05) based on the multiple linear regression model (controlling for all other variables) for the vehicle death rate were population density and poverty rate. Our multiple linear regression model (controlling for all other variables) for the total death rate showed that the variables significant (p<0.05) were McDonald’s per capita, poverty rate, and colleges per capita. Although correlations were found between the different causes of death and the variables that were chosen, causal relationships cannot be determined. For each correlation, there could be confounding variables that were not accounted for. For example, in the vehicle death rate, the amount of people that take public transportation as opposed to driving per state would be a confounding variable when determining the vehicle death rate. When discussing the implications of this research, it is also important to take into account that although the correlations shown are shown to be statistically significant, all of the correlations are not practically significant. This is because the coefficient increment increases are so small in some cases, that they would have little to no effect in the real world. Another limitation would be ecological fallacy. Our findings cannot be translated to individuals because our data was collected from states as opposed to individuals. Additionally, the inability of our models to satisfy all of the conditions of multiple linear regression means that we cannot definitively recommend this model as translational to other cases outside the scope of this analyzed data. Another reason for this is because the sample was confined to fifty states and DC for a duration of one year. To improve the model, we could add additional variables and expand our sample size by adding data from multiple years. There are also concerns with the data itself. Reports on causes of death rely largely on reports from doctors and hospitals. In underfunded hospitals or health centers, causes of death may be misreported. Causes of death can also be unintentionally misreported– for example, a death from suicide may be reported as an alcohol related death. A patient with little history of of mental history issues would not prompt the doctors or individuals to futher into the death, and there is no way to make a solid differentiation post mortem.
References
“Leading Causes of Death.” Centers for Disease Control and Prevention. September 2015. Accessed November 2015.] http://www.cdc.gov/nchs/fastats/leading-causes-of-death.html
“Fishing, Logging, Flying an Airplane: Here Are America’s Deadliest Jobs.” Thompson, Derek. January 25, 2013. Accessed November 2015. http://www.theatlantic.com/business/archive/2013/01/fishing-logging-flying-an-airplane-here-are-americas-deadliest-jobs/272542/
“The 25 Most Common Causes of Death.” Jagger, Chris. Accessed November 2015. http://www.medhelp.org/general-health/articles/The-25-Most-Common-Causes-of-Death/193
“Here are the 11 Most McDonald’s-Heavy States in America.” Theeboom, Sarah. June 5, 2014. Accessed November 2015. (http://firstwefeast.com/eat/states-with-the-most-mcdonalds/
“Annual Death Rate by State.” Accessed November 2015. (http://www.statisticbrain.com/death-rate-by-state/
“Number of Deaths per 100,000 Population.” Accessed November 2015. (http://kff.org/other/state-indicator/death-rate-per-100000/
“Colleges and Universities in the United States of America by State/Possession.” August 12, 2013. Accessed November 2015. http://www.univsearch.com/state.php
“State by State Data.” Accessed November 2015. http://ticas.org/posd/map-state-data-2015
“Nutrition, Physical Activity and Obesity: Data, Trends, and Maps.” Centers for Disease Control and Prevention. Accessed November 2015. http://nccd.cdc.gov/NPAO_DTM/LocationSummary.aspx?state=California