Lecture 03: Simple linear regression

Professor McNamara

Four-step process

  • Choose
  • Fit
  • Assess
  • Use

Choose

In this course, we will work in two primary paradigms:

  • Quantitative response variable
  • Categorical response variable

The model we choose will depend on what type of response variable we have.

Model notation

\[ y = f(x) + \epsilon \]

  • \(Y\):
  • \(f\):
  • \(X\):
  • \(\epsilon\):

Car data example

name sports_car all_wheel msrp horsepwr weight wheel_base
Jaguar S-Type 4.2 4dr FALSE FALSE 49995 294 3874 115
Mercedes-Benz E500 4dr FALSE FALSE 57270 302 3815 112
Kia Spectra GS 4dr hatch FALSE FALSE 13580 124 2686 101
Hyundai Sonata GLS 4dr FALSE FALSE 19339 170 3217 106
  • What type of model would we use to predict the Manufacturer’s Suggested Retail Price (MSRP)?
  • To predict whether or not the car is a sports car?
  • To predict the horsepower?

Simple linear regression

For simple linear regression, the model

\[ Y = f(X) + \epsilon \] has the following qualities:

  • \(Y\):
  • \(f\):
  • \(X\):
  • \(\epsilon\):

Parameters and statistics

  • parameters

  • statistics

More notation

  • \(Y,X\):
  • \(y,x\):
  • \(y_i,x_i\):
  • \(\bar{y}, \bar{x}\): \[ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i \]
  • \(\hat{y}\):

slr model notation

Model \[ Y = \beta_0 + \beta_1\cdot X + \epsilon, \, \epsilon \sim N(0, \sigma_\epsilon) \] Estimate

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1}\cdot x \]

residuals

observed - expected \[ y_i - \hat{y}_i \]

Car price: exploratory data analysis

Let’s look at the relationship between the car’s weight (in pounds) and the MSRP (in dollars).

ggplot(cars04) +
  geom_point(aes(x = weight, y = msrp))

How could we describe this relationship?

Car price: fit

Given how the data looked in the scatterplot we saw above, it seems reasonable to choose a simple linear regression model. We can then fit it using R. Here’s how we fit a simple linear regression model:

lm(msrp~weight, data = cars04)

Call:
lm(formula = msrp ~ weight, data = cars04)

Coefficients:
(Intercept)       weight  
   -8383.10        11.51  

Least squares regression

To fit the model, R is minimizing the sum of the squared residuals,

\[ SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2, \qquad \hat{\sigma}_\epsilon = \sqrt{\frac{SSE}{n-2}} \] Sometimes the SSE is also called the Residual Sum of Squares (RSS).

Car price: use

One thing I want you to be able to do is write out the equation of the line.

lm(msrp~weight, data = cars04)

Call:
lm(formula = msrp ~ weight, data = cars04)

Coefficients:
(Intercept)       weight  
   -8383.10        11.51  

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

Car price: use

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

  • What would this model predict for a Volvo XC70, which weighs 3823 pounds?

Car price: assess

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

  • The XC70 actually cost $35,145 new, what would the residual for this car be? Did we overpredict or underpredict?

Interpretation sentences

I like to use ‘recipe’ sentences for interpretations. Here are two:

Slope: “For a 1-[unit] increase in [explanatory variable], we would expect to see a [\(\hat{\beta}_1\)]-[unit] [increase/decrease] in [response variable].”

Intercept: “If the value of [explanatory variable] was 0, we would expect [response variable] to be [\(\hat{\beta}_0\)] [units].”

Car price: use

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \] How would we interpret the coefficients from this model? Does it make sense to interpret the intercept?

Fitting the model in R, again

The first time I fit the model, all I wanted were the coefficients

lm(msrp~weight, data = cars04)

But in this class, we are going to need to go a lot further. So, we need to store our model object in R and give it a name. I use kind of bad names (m1, m2, etc) you could choose a better name.

m1 <- lm(msrp~weight, data = cars04)

We’re using the assignment operator to store the results of our function into a named object.

I’m using the assignment operator <-, but you can also use =. As with many things, I’ll try to be consistent, but I sometimes switch between the two.

R: More assessing and using

If we want to do more assessing and using, we typically want to look at the model output (the fitted coefficients, etc). If we run summary() on our model object we can look at the output.

summary(m1)

Call:
lm(formula = msrp ~ weight, data = cars04)

Residuals:
   Min     1Q Median     3Q    Max 
-32870  -8283  -5023   3324 164823 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8383.095   4062.760  -2.063   0.0397 *  
weight         11.506      1.111  10.357   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17420 on 424 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.2019,    Adjusted R-squared:    0.2 
F-statistic: 107.3 on 1 and 424 DF,  p-value: < 2.2e-16

Notice that the coefficients are the same, they just show up in a column rather than printed out like a row. Let’s write out the model equation again.

Bonus– adding a line to a scatterplot

Sometimes it’s nice to be able to see the fitted line on the scatterplot. Note that putting the line on the scatterplot is not what I mean when I say to “fit the model” (I want you using lm() for that) but it’s a nice way to visualize what’s going on.

First, I’m going to show you another way to make the scatterplot we did before:

ggplot(cars04) +
  geom_point(aes(x = weight, y = msrp))

ggplot(cars04, aes(x = weight, y = msrp)) +
  geom_point()

Bonus– adding a line to a scatterplot

Now, we can add on another geometric object, a line, with the same aesthetics.

ggplot(cars04, aes(x = weight, y = msrp)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)