library(tidyverse)
library(Stat2Data)
library(skimr)
To this point in the course we have learned how to build models for quantitative response variables as functions of quantiative explanatory variables, and binary categorical variables. Today we are going to learn how to build models for quantitative response variables as a function of a single categorical variable. These techinques are known (somewhat confusingly) as ANOVA.
Let’s follow the example from the book about fruit flies. Here, researchers have conducted an experiment to measure the Longevity
of males fruit flies (in days). 25 male fruit flies were assigned to live in five different environments (Treatment
). Our goal is to understand how the average lifespan is related to the particular treatment. To do this we will use an ANOVA model.
data(FruitFlies)
head(FruitFlies)
Before we do anything else, we want to pay attention to the treatment levels. The Treatment
variable has the type factor in R. There are five different kinds of treatments:
FruitFlies %>%
count(Treatment)
In this case, the factor level none corresponds to the control group, so we want to tell R to use that as the reference level, since by default, it uses the first label in alphabetical order.
# Set the reference level
FruitFlies <- FruitFlies %>%
mutate(Treatment = relevel(Treatment, ref="none"))
You can check the count()
again to see what has happened to the data.
Next, let’s do some basic exploratory data analysis. We can examine the means and standard deviations among the different treatments graphically
ggplot(data=FruitFlies) + geom_dotplot(aes(x=Longevity))
ggplot(data=FruitFlies) + geom_dotplot(aes(x=Longevity)) + facet_wrap(~Treatment)
ggplot(data=FruitFlies) + geom_boxplot(aes(y=Longevity, x=1))
ggplot(data=FruitFlies) + geom_boxplot(aes(y=Longevity, x=Treatment))
and in a table:
FruitFlies %>%
group_by(Treatment) %>%
skim(Longevity)
The simplest model we could build for Longevity
would have no variable at all – just the average value of the response variable. This model is sometimes called the grand mean model, since the only term in the model is the mean of the response variable.
fm_null <- lm(Longevity ~ 1, data=FruitFlies)
summary(fm_null)
FruitFlies %>%
skim(Longevity)
fitted <- fitted.values(fm_null)
One way to visualize it is with a dot plot, and add the “regression line.”
ggplot(data=FruitFlies) + geom_dotplot(aes(x=Longevity)) + geom_vline(xintercept = coef(fm_null), color="red")
Of course we’d like to construct a better model by having separate estimates for each of the different groups. But we want to understand the differences between groups in the context of the grand mean. To do this we build a model of the form \[ Y = \mu + \alpha_1 + \cdots + \alpha_K + \epsilon, \] for \(k=1,...,K\). As usual, we require that \(\epsilon \sim N(0, \sigma_\epsilon)\).
This is straightfoward in R:
fm_aov <- aov(Longevity ~ Treatment, data=FruitFlies)
summary(fm_aov)
model.tables(fm_aov)
Note that this model is equivalent, up to some algebra, of the one produced by lm
, which has the form: \[
Y = \beta_0 + \beta_1 + \cdots + \beta_{K-1} + \epsilon',
\] for \(k=1,...,K-1\). As usual, we require that \(\epsilon' \sim N(0, \sigma_\epsilon')\). Note that \(\beta_0 = \mu + \alpha_1\), and \(\beta_{k} = \alpha_{k+1} - \alpha_1\), for \(k = 1,\ldots,K-1\).
fm_ref <- lm(Longevity ~ Treatment, data=FruitFlies)
summary(fm_ref)
anova(fm_ref)
FruitFlies <- FruitFlies %>%
mutate(grandmean = mean(Longevity)) %>%
group_by(Treatment) %>%
mutate(groupmean = mean(Longevity), effect = groupmean - grandmean)
FruitFlies
effects <- FruitFlies %>%
group_by(Treatment) %>%
summarize(effect = mean(effect))
We can plot our “model” by adding lines to each panel.
ggplot(data=FruitFlies) + geom_dotplot(aes(x=Longevity)) + geom_vline(xintercept = effects$effect, color="red") + facet_wrap(~Treatment)
We still need to check whether the conditions for ANOVA were met.
plot(fm_aov)
FruitFlies %>%
group_by(Treatment) %>%
skim(Longevity)
Weight
as a function of Species
.data(Hawks)