Multicollinearity, VIF, and nested F-tests

require(mosaic)
require(car)

Researchers observed the following data on 20 individuals with high blood pressure:

blood pressure (BP, in mm Hg)
age (Age, in years)
weight (Weight, in kg)
body surface area (BSA, in m^2)
duration of hypertension (Dur, in years)
basal pulse (Pulse, in beats per minute)
stress index (Stress)

Our goal is to build a model for blood pressure as a function of (some subset) of the other variables. In this case all of our variables are quantitative, so we can get a quick look at their relationships using the a pairs plot.

BP <- read.csv("http://www.math.smith.edu/~bbaumer/mth247/labs/bloodpress.csv")
pairs(BP)

# Better than the standard pairs plot is the generalized pairs plot. 
#install.packages("gpairs")
#gpairs(BP)

What do you see in these scatterplots? Which of the variables are most highly correlated with BP?

Hint: use cor to calculate the correlation coefficient matrix.

A First Model

Weight seems to be highly correlated with BP, so as a first step, we should understand how well a simple linear model for blood pressure as a function of weight works. Keep in mind that it accords with our intuition that there would be a strong link between a person’s weight and their blood pressure.

xyplot(BP ~ Weight, data=BP, pch=19, alpha=0.5, cex=1.5, type=c("p", "r"))

m1 <- lm(BP ~ Weight, data=BP)
summary(m1)

## 
## Call:
## lm(formula = BP ~ Weight, data = BP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6933 -0.9318 -0.4935  0.7703  4.8656 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.20531    8.66333   0.255    0.802    
## Weight       1.20093    0.09297  12.917 1.53e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.74 on 18 degrees of freedom
## Multiple R-squared:  0.9026, Adjusted R-squared:  0.8972 
## F-statistic: 166.9 on 1 and 18 DF,  p-value: 1.528e-10

How well does the SLR model for blood pressure work?
Check the assumptions for SLR. Are they met?

plot(m1, which=c(1,2))

The Kitchen Sink

The SLR model for blood pressure is pretty effective, but there is one person who generates a large residual. We want to use the additional data that we have to build a better model for blood pressure.

Without any intution, one way to proceed is to simply throw all of the variables into our regression model. In a sense, we are throwing everything but the kitchen sink into the model.

# Using the . in the formula interface includes all non-response variables in the data frame
mfull <- lm(BP ~ ., data=BP)
summary(mfull)

## 
## Call:
## lm(formula = BP ~ ., data = BP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93213 -0.11314  0.03064  0.21834  0.48454 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.870476   2.556650  -5.034 0.000229 ***
## Age           0.703259   0.049606  14.177 2.76e-09 ***
## Weight        0.969920   0.063108  15.369 1.02e-09 ***
## BSA           3.776491   1.580151   2.390 0.032694 *  
## Dur           0.068383   0.048441   1.412 0.181534    
## Pulse        -0.084485   0.051609  -1.637 0.125594    
## Stress        0.005572   0.003412   1.633 0.126491    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4072 on 13 degrees of freedom
## Multiple R-squared:  0.9962, Adjusted R-squared:  0.9944 
## F-statistic: 560.6 on 6 and 13 DF,  p-value: 6.395e-15

plot(mfull, which=c(1,2))

xyplot(fitted.values(mfull) ~ BP, data=BP)

We have a very high \(R^2\), and as such the correlation between the fitted values and the response variable is very strong. However, there are still some issues with the normality of the residuals. Moreover, the coefficient for Pulse is negative, suggesting that people with higher pulse rates have lower blood pressure. [Does this make sense?] Three of the coefficients in the model are close to zero and not significant at the 10% level.

Checking for Multicollinearity

The easiest way to check for multicollinearity is by computing the Variance Inflation Factor (VIF) for each explanatory variable.

# Note that this requires the "car" package!
require(car)
vif(mfull)

##      Age   Weight      BSA      Dur    Pulse   Stress 
## 1.762807 8.417035 5.328751 1.237309 4.413575 1.834845

VIFs higher than 4, such as those for Weight, BSA, and Pulse indicate that those variables are highly correlated with the other explanatory variables. To see this, we can manually compute the VIF for Weight, by regressing it against the other explanatory variables.

mvif <- lm(Weight ~ Age + BSA + Dur + Pulse + Stress, data=BP)
summary(mvif)

## 
## Call:
## lm(formula = Weight ~ Age + BSA + Dur + Pulse + Stress, data = BP)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7697 -1.0120  0.1960  0.6955  2.7035 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.674438   9.464742   2.079  0.05651 .  
## Age         -0.144643   0.206491  -0.700  0.49510    
## BSA         21.421654   3.464586   6.183 2.38e-05 ***
## Dur          0.008696   0.205134   0.042  0.96678    
## Pulse        0.557697   0.159853   3.489  0.00361 ** 
## Stress      -0.022997   0.013079  -1.758  0.10052    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.725 on 14 degrees of freedom
## Multiple R-squared:  0.8812, Adjusted R-squared:  0.8388 
## F-statistic: 20.77 on 5 and 14 DF,  p-value: 5.046e-06

1 / (1 - rsquared(mvif))

## [1] 8.417035

Note that if this kind of regression works well, then the \(R^2\) will be high, and the VIF will become very large. A similar computation can be performed for each of the explanatory variables.

Another window into the entangled web of our explanatory variables is to examine the correlation between every pair.

cor(BP)

##               BP       Age     Weight        BSA       Dur     Pulse
## BP     1.0000000 0.6590930 0.95006765 0.86587887 0.2928336 0.7214132
## Age    0.6590930 1.0000000 0.40734926 0.37845460 0.3437921 0.6187643
## Weight 0.9500677 0.4073493 1.00000000 0.87530481 0.2006496 0.6593399
## BSA    0.8658789 0.3784546 0.87530481 1.00000000 0.1305400 0.4648188
## Dur    0.2928336 0.3437921 0.20064959 0.13054001 1.0000000 0.4015144
## Pulse  0.7214132 0.6187643 0.65933987 0.46481881 0.4015144 1.0000000
## Stress 0.1639014 0.3682237 0.03435475 0.01844634 0.3116398 0.5063101
##            Stress
## BP     0.16390139
## Age    0.36822369
## Weight 0.03435475
## BSA    0.01844634
## Dur    0.31163982
## Pulse  0.50631008
## Stress 1.00000000

Note that BSA and Weight are strongly correlated [can you think of why this might be?] Moreover, Pulse is fairly strongly correlated with Age, Weight, and Stress. Since we know that Weight is so strongly correlated with BP to begin with, we’ll want to keep this variable in our model. But let’s try removing BSA and Pulse.

m2 <- lm(BP ~ Weight +  Age + Dur + Stress, data=BP)
summary(m2)

## 
## Call:
## lm(formula = BP ~ Weight + Age + Dur + Stress, data = BP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11359 -0.29586  0.01515  0.27506  0.88674 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.869829   3.195296  -4.967 0.000169 ***
## Weight        1.034128   0.032672  31.652 3.76e-15 ***
## Age           0.683741   0.061195  11.173 1.14e-08 ***
## Dur           0.039889   0.064486   0.619 0.545485    
## Stress        0.002184   0.003794   0.576 0.573304    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5505 on 15 degrees of freedom
## Multiple R-squared:  0.9919, Adjusted R-squared:  0.9897 
## F-statistic: 458.3 on 4 and 15 DF,  p-value: 1.764e-15

vif(m2)

##   Weight      Age      Dur   Stress 
## 1.234653 1.468245 1.200060 1.241117

plot(m2, which=c(1,2))

This simpler model explains almost as much of the variation as the full model, but with two fewer explanatory variables. Moreover, residuals are closer to being normal, and the high VIFs have been eliminated. This model should be preferred to the full model.

Still, two of the variables in the reduced model (Dur and Stress), do not seem to be contributing much to the model. Let’s see what happens if we eliminate them.

m3 <- lm(BP ~ Weight +  Age, data=BP)
summary(m3)

## 
## Call:
## lm(formula = BP ~ Weight + Age, data = BP)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89968 -0.35242  0.06979  0.35528  0.82781 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -16.57937    3.00746  -5.513 3.80e-05 ***
## Weight        1.03296    0.03116  33.154  < 2e-16 ***
## Age           0.70825    0.05351  13.235 2.22e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5327 on 17 degrees of freedom
## Multiple R-squared:  0.9914, Adjusted R-squared:  0.9904 
## F-statistic: 978.2 on 2 and 17 DF,  p-value: < 2.2e-16

vif(m3)

##   Weight      Age 
## 1.198945 1.198945

plot(m3, which=c(1,2))

This model is again nearly as effective as the full model, with no evidence of multicollinearity, and more normally distributed residuals. What we have lost in \(R^2\), we have more than made up for in simplicity and conformity to our assumptions.