require(oilabs)
require(mosaic)
deaths<- read.csv("deaths.csv")
deaths<- mutate(deaths, mcd_percapita=mcd/pop)
deaths<- mutate(deaths, colleges_percapita=colleges/pop)
deaths<- mutate(deaths, athletes_percapita=athletes/pop)

Abstract

The purpose of this study is to determine whether there are any correlations between various causes of death and different variables. By examining different variables and their effect on ways of dying, we can better understand how real world factors contribute to death. This study examined fifty states in the United States (US), as well as the District of Columbia (D.C.), using multiple linear regression models to determine any statistical significance between the variables and death rates. This model showed that McDonald’s per capita, poverty rate, and colleges per capita all have a significant effect on total death rate (meaning p<0.05), while the variables with a significant effect on different ways of dying were varied.

Introduction

There are many different ways of dying. Some of the most prevalent causes of death are heart disease, cancer, stroke, vehicle accidents, and accidental death. In this study, we wanted to explore how different variables, such as poverty rate, obesity rate, population density and more, had a relationship, if any, between the different causes of death. The research question is: are there any correlations between ways of dying and different variables?

In this observational study, we took death rate data for 11 different ways of dying and also data from 7 variables in each state in the US, as well as D.C. in 2013. We then created and analyzed regression models to look for any significant relationship (p<0.05) between the various variables and causes of death.

Data

Data was collected from the Center for Disease Control and Prevention (CDC) on the number of deaths from different causes in the year 2013 for all 50 states as well as the D.C.. The total death rate (number of deaths/population for each state) and the percentage of deaths from different causes were the response variables, while the explanatory variables were population density, number of McDonald’s per capita, poverty rate, number of colleges per capita, high school graduation rate, obesity rate, and the number of 2014 olympic athletes per capita. These were gathered separately from a number of sources (see references). The population of each state was controlled for by representing the variables as rates or per capita data.

Results

Multiple Linear Regression Models:

deathrate <- lm(deathrate~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)

alcohol <- lm(alcohol_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)

murder <- lm(murder_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)

heartdisease <- lm(heartdisease_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)

accidents <- lm(accident_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)

vehicle <- lm(vehicle_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)

flu_pneumonia <- lm(flupneumonia_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)

alzheimers <- lm(alzheimer_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc + obesity_perc + athletes_percapita, data=deaths)

suicide <- lm(suicide_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)

hiv <- lm(hiv_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)

drugs <- lm(drug_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)

diabetes <- lm(diabetes_perc~pop_density + mcd_percapita + poverty_perc + colleges_percapita + hsgrad_perc +obesity_perc + athletes_percapita, data=deaths)

Best fit multiple linear regression models using backwards elimination:

deathrate_bf <- lm(deathrate~pop_density + mcd_percapita + poverty_perc + colleges_percapita, data=deaths)

alcohol_bf <- lm(alcohol_perc~ mcd_percapita, data=deaths)

murder_bf <- lm(murder_perc~pop_density + hsgrad_perc + obesity_perc, data=deaths)

heartdisease_bf <- lm(heartdisease_perc~pop_density, data=deaths)

accidents_bf <- lm(accident_perc~pop_density + mcd_percapita + colleges_percapita + hsgrad_perc, data=deaths)

vehicle_bf <- lm(vehicle_perc~pop_density + poverty_perc + athletes_percapita, data=deaths)

flu_pneumonia_bf <- lm(flupneumonia_perc~obesity_perc, data=deaths)

alzheimers_bf <- lm(alzheimer_perc~athletes_percapita, data=deaths)

suicide_bf <- lm(suicide_perc~pop_density + poverty_perc + hsgrad_perc, data=deaths)

hiv_bf <- lm(hiv_perc~pop_density + colleges_percapita + hsgrad_perc, data=deaths)

drugs_bf <- lm(drug_perc~poverty_perc + hsgrad_perc, data=deaths)

diabetes_bf <- lm(diabetes_perc~pop_density + mcd_percapita + poverty_perc, data=deaths)

The best fit linear models usually had a higher R^2 value than their equivalent linear models accounting for all the variables. This means the best fit models explain more of the variability.

The linear model for the total death rate is:

Total death rate = -2.13e-03 - 1.67e-07(population density) + 6.36e01(McD per capita) + 1.97e-04(poverty rate) + 3.56e01(colleges per capita) + 3.89e-05(high school graduation rate) + 5.25e-06(obesity rate) + 3.87e01(olympic athletes per capita)

This was tested with the data from Massachusetts to get:

Total death rate = -2.13e-03 - 1.67e-07(858) + 6.36e01(4.43e-05) + 1.97e-04(11.6) + 3.56e01(3.89e-05) + 3.89e-05(86) + 5.25e-06(28.2) + 3.87e01(1.49e-07) = 0.0077

Which is quite close the the value found in the data of 0.0081.

Total Death Rate Analysis

Checking Conditions for Multiple Linear Regression Model using L.I.N.E.

plot(deathrate)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced