require(mosaic)
require(car)
Researchers observed the following data on 20 individuals with high blood pressure:
BP
, in mm Hg)Age
, in years)Weight
, in kg)BSA
, in m^2
)Dur
, in years)Pulse
, in beats per minute)Stress
)Our goal is to build a model for blood pressure as a function of (some subset) of the other variables. In this case all of our variables are quantitative, so we can get a quick look at their relationships using the a pairs plot.
BP <- read.csv("http://www.math.smith.edu/~bbaumer/mth247/labs/bloodpress.csv")
pairs(BP)
# Better than the standard pairs plot is the generalized pairs plot.
#install.packages("gpairs")
#gpairs(BP)
BP
?Hint: use
cor
to calculate the correlation coefficient matrix.
Weight
seems to be highly correlated with BP
, so as a first step, we should understand how well a simple linear model for blood pressure as a function of weight works. Keep in mind that it accords with our intuition that there would be a strong link between a person’s weight and their blood pressure.
xyplot(BP ~ Weight, data=BP, pch=19, alpha=0.5, cex=1.5, type=c("p", "r"))
m1 <- lm(BP ~ Weight, data=BP)
summary(m1)
##
## Call:
## lm(formula = BP ~ Weight, data = BP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6933 -0.9318 -0.4935 0.7703 4.8656
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.20531 8.66333 0.255 0.802
## Weight 1.20093 0.09297 12.917 1.53e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.74 on 18 degrees of freedom
## Multiple R-squared: 0.9026, Adjusted R-squared: 0.8972
## F-statistic: 166.9 on 1 and 18 DF, p-value: 1.528e-10
plot(m1, which=c(1,2))
The SLR model for blood pressure is pretty effective, but there is one person who generates a large residual. We want to use the additional data that we have to build a better model for blood pressure.
Without any intution, one way to proceed is to simply throw all of the variables into our regression model. In a sense, we are throwing everything but the kitchen sink into the model.
# Using the . in the formula interface includes all non-response variables in the data frame
mfull <- lm(BP ~ ., data=BP)
summary(mfull)
##
## Call:
## lm(formula = BP ~ ., data = BP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93213 -0.11314 0.03064 0.21834 0.48454
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -12.870476 2.556650 -5.034 0.000229 ***
## Age 0.703259 0.049606 14.177 2.76e-09 ***
## Weight 0.969920 0.063108 15.369 1.02e-09 ***
## BSA 3.776491 1.580151 2.390 0.032694 *
## Dur 0.068383 0.048441 1.412 0.181534
## Pulse -0.084485 0.051609 -1.637 0.125594
## Stress 0.005572 0.003412 1.633 0.126491
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4072 on 13 degrees of freedom
## Multiple R-squared: 0.9962, Adjusted R-squared: 0.9944
## F-statistic: 560.6 on 6 and 13 DF, p-value: 6.395e-15
plot(mfull, which=c(1,2))
xyplot(fitted.values(mfull) ~ BP, data=BP)
We have a very high \(R^2\), and as such the correlation between the fitted values and the response variable is very strong. However, there are still some issues with the normality of the residuals. Moreover, the coefficient for Pulse
is negative, suggesting that people with higher pulse rates have lower blood pressure. [Does this make sense?] Three of the coefficients in the model are close to zero and not significant at the 10% level.
The easiest way to check for multicollinearity is by computing the Variance Inflation Factor (VIF) for each explanatory variable.
# Note that this requires the "car" package!
require(car)
vif(mfull)
## Age Weight BSA Dur Pulse Stress
## 1.762807 8.417035 5.328751 1.237309 4.413575 1.834845
VIFs higher than 4, such as those for Weight
, BSA
, and Pulse
indicate that those variables are highly correlated with the other explanatory variables. To see this, we can manually compute the VIF for Weight
, by regressing it against the other explanatory variables.
mvif <- lm(Weight ~ Age + BSA + Dur + Pulse + Stress, data=BP)
summary(mvif)
##
## Call:
## lm(formula = Weight ~ Age + BSA + Dur + Pulse + Stress, data = BP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7697 -1.0120 0.1960 0.6955 2.7035
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.674438 9.464742 2.079 0.05651 .
## Age -0.144643 0.206491 -0.700 0.49510
## BSA 21.421654 3.464586 6.183 2.38e-05 ***
## Dur 0.008696 0.205134 0.042 0.96678
## Pulse 0.557697 0.159853 3.489 0.00361 **
## Stress -0.022997 0.013079 -1.758 0.10052
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.725 on 14 degrees of freedom
## Multiple R-squared: 0.8812, Adjusted R-squared: 0.8388
## F-statistic: 20.77 on 5 and 14 DF, p-value: 5.046e-06
1 / (1 - rsquared(mvif))
## [1] 8.417035
Note that if this kind of regression works well, then the \(R^2\) will be high, and the VIF will become very large. A similar computation can be performed for each of the explanatory variables.
Another window into the entangled web of our explanatory variables is to examine the correlation between every pair.
cor(BP)
## BP Age Weight BSA Dur Pulse
## BP 1.0000000 0.6590930 0.95006765 0.86587887 0.2928336 0.7214132
## Age 0.6590930 1.0000000 0.40734926 0.37845460 0.3437921 0.6187643
## Weight 0.9500677 0.4073493 1.00000000 0.87530481 0.2006496 0.6593399
## BSA 0.8658789 0.3784546 0.87530481 1.00000000 0.1305400 0.4648188
## Dur 0.2928336 0.3437921 0.20064959 0.13054001 1.0000000 0.4015144
## Pulse 0.7214132 0.6187643 0.65933987 0.46481881 0.4015144 1.0000000
## Stress 0.1639014 0.3682237 0.03435475 0.01844634 0.3116398 0.5063101
## Stress
## BP 0.16390139
## Age 0.36822369
## Weight 0.03435475
## BSA 0.01844634
## Dur 0.31163982
## Pulse 0.50631008
## Stress 1.00000000
Note that BSA
and Weight
are strongly correlated [can you think of why this might be?] Moreover, Pulse
is fairly strongly correlated with Age
, Weight
, and Stress
. Since we know that Weight
is so strongly correlated with BP
to begin with, we’ll want to keep this variable in our model. But let’s try removing BSA
and Pulse
.
m2 <- lm(BP ~ Weight + Age + Dur + Stress, data=BP)
summary(m2)
##
## Call:
## lm(formula = BP ~ Weight + Age + Dur + Stress, data = BP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11359 -0.29586 0.01515 0.27506 0.88674
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.869829 3.195296 -4.967 0.000169 ***
## Weight 1.034128 0.032672 31.652 3.76e-15 ***
## Age 0.683741 0.061195 11.173 1.14e-08 ***
## Dur 0.039889 0.064486 0.619 0.545485
## Stress 0.002184 0.003794 0.576 0.573304
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5505 on 15 degrees of freedom
## Multiple R-squared: 0.9919, Adjusted R-squared: 0.9897
## F-statistic: 458.3 on 4 and 15 DF, p-value: 1.764e-15
vif(m2)
## Weight Age Dur Stress
## 1.234653 1.468245 1.200060 1.241117
plot(m2, which=c(1,2))
This simpler model explains almost as much of the variation as the full model, but with two fewer explanatory variables. Moreover, residuals are closer to being normal, and the high VIFs have been eliminated. This model should be preferred to the full model.
Still, two of the variables in the reduced model (Dur
and Stress
), do not seem to be contributing much to the model. Let’s see what happens if we eliminate them.
m3 <- lm(BP ~ Weight + Age, data=BP)
summary(m3)
##
## Call:
## lm(formula = BP ~ Weight + Age, data = BP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89968 -0.35242 0.06979 0.35528 0.82781
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -16.57937 3.00746 -5.513 3.80e-05 ***
## Weight 1.03296 0.03116 33.154 < 2e-16 ***
## Age 0.70825 0.05351 13.235 2.22e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5327 on 17 degrees of freedom
## Multiple R-squared: 0.9914, Adjusted R-squared: 0.9904
## F-statistic: 978.2 on 2 and 17 DF, p-value: < 2.2e-16
vif(m3)
## Weight Age
## 1.198945 1.198945
plot(m3, which=c(1,2))
This model is again nearly as effective as the full model, with no evidence of multicollinearity, and more normally distributed residuals. What we have lost in \(R^2\), we have more than made up for in simplicity and conformity to our assumptions.