library(tidyverse)
library(Stat2Data)
library(skimr)
The book uses the FirstYearGPA
data set to illustrate automated procedures for variable selection in multiple regression. Since the book uses Minitab, we will work through some examples of how to do this in R.
First, we have to load the data.
data(FirstYearGPA)
head(FirstYearGPA)
Keep in mind that there is no such thing as a “best” multiple regression model. Each model has its strengths and weaknesses, and it is your job as the data anlayst to construct a model that is effective for your purposes. That could mean forgoing a multiple with a higher \(R^2\) for one that is conceptually simpler. Or it could mean added polynomial terms that explain only a little bit of extra variation in the response.
With this caveat, there is no way to fully automate a procedure that will find the best regression model. Thus, there are several algorithms for optimization a given criterion. There are at least four different optimization procedures:
We will discuss each. Furthermore, each of these techniques can optimize a different criterion. There is no universally agreed-upon best criterion, but the following are popularly used:
While the book focuses on Mallow’s \(C_p\), we will focus on AIC, since that is what R uses by default.
The most straightforward method for variable selection is to simply try all of the subsets. Unfortunately, if the number of predictors is \(k\), then you have to try all \(2^k\) subsets! Thus, the number of subsets grows exponentially, which is very fast. [Computer scientists refer to such algorithms as having exponential running times, which are considered extremely inefficient.]
The following code will produce best subsets regression.
# install.packages("leaps")
library(leaps)
# Reports the two best models for each number of predictors
best <- regsubsets(GPA ~ ., data=FirstYearGPA, nbest=2, method="exhaustive")
with(summary(best), data.frame(rsq, adjr2, cp, rss, outmat))
AIC()
command on your model.Backwards Elimination is a simple algorithm that begins by throwing all of the terms into the model, and then greedily removing the ones that are least statistically significant.
backward <- regsubsets(GPA ~ ., data=FirstYearGPA, nbest = 1, nvmax = 6, method = "backward")
summary(backward)
with(summary(backward), data.frame(cp, outmat))
Forward Selection is the opposite idea of Backwards Elimination. Here, we begin with an empty model and then greedily add terms that have a statistically significant effect.
forward <- regsubsets(GPA ~ ., data=FirstYearGPA, nbest = 1, nvmax = 6, method = "forward")
with(summary(forward), data.frame(cp, outmat))
Stepwise regression combines the ideas of Backwards Elimination and Forward Selection to move in both directions.
stepwise <- regsubsets(GPA ~ ., data=FirstYearGPA, nbest = 1, nvmax = 6, method = "seqrep")
with(summary(stepwise), data.frame(cp, outmat))
Each of these methods will come up with slightly different models. As the analyst, your job is choose which one you think is best, based on your contextual knowledge and understanding of the algorithms. That being said,