Lecture 03: Simple linear regression

Professor McNamara

Four-step process

Choose
Fit
Assess
Use

Choose

In this course, we will work in two primary paradigms:

Quantitative response variable
Categorical response variable

The model we choose will depend on what type of response variable we have.

Model notation

\[ y = f(x) + \epsilon \]

$Y$:
$f$:
$X$:
$\epsilon$:

Car data example

name	sports_car	all_wheel	msrp	horsepwr	weight	wheel_base
Audi S4 Avant Quattro	FALSE	TRUE	49090	340	3936	104
Ford Focus LX 4dr	FALSE	FALSE	13730	110	2606	103
BMW 330Ci convertible 2dr	FALSE	FALSE	44295	225	3616	107
Volvo V40	FALSE	FALSE	26135	170	2822	101

What type of model would we use to predict the Manufacturer’s Suggested Retail Price (MSRP)?
To predict whether or not the car is a sports car?
To predict the horsepower?

Simple linear regression

For simple linear regression, the model

\[ Y = f(X) + \epsilon \] has the following qualities:

$Y$:
$f$:
$X$:
$\epsilon$:

Parameters and statistics

parameters
statistics

More notation

$Y,X$:
$y,x$:
$y_i,x_i$:
$\bar{y}, \bar{x}$: \[ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i \]
$\hat{y}$:

slr model notation

Model \[ Y = \beta_0 + \beta_1\cdot X + \epsilon, \, \epsilon \sim N(0, \sigma_\epsilon) \] Estimate

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1}\cdot x \]

residuals

observed - expected \[ y_i - \hat{y}_i \]

Car price: exploratory data analysis

Let’s look at the relationship between the car’s weight (in pounds) and the MSRP (in dollars).

ggplot(cars04) +
  geom_point(aes(x = weight, y = msrp))

How could we describe this relationship?

Car price: fit

Given how the data looked in the scatterplot we saw above, it seems reasonable to choose a simple linear regression model. We can then fit it using R. Here’s how we fit a simple linear regression model:

lm(msrp~weight, data = cars04)


Call:
lm(formula = msrp ~ weight, data = cars04)

Coefficients:
(Intercept)       weight  
   -8383.10        11.51

Least squares regression

To fit the model, R is minimizing the sum of the squared residuals,

\[ SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2, \qquad \hat{\sigma}_\epsilon = \sqrt{\frac{SSE}{n-2}} \] Sometimes the SSE is also called the Residual Sum of Squares (RSS).

Car price: use

One thing I want you to be able to do is write out the equation of the line.

lm(msrp~weight, data = cars04)


Call:
lm(formula = msrp ~ weight, data = cars04)

Coefficients:
(Intercept)       weight  
   -8383.10        11.51

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

Car price: use

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

What would this model predict for a Volvo XC70, which weighs 3823 pounds?

Car price: assess

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

The XC70 actually cost $35,145 new, what would the residual for this car be? Did we overpredict or underpredict?

Interpretation sentences– quantitative predictor

I like to use ‘recipe’ sentences for interpretations. Here are two:

Slope: “For a 1-[unit] increase in [explanatory variable], we would expect to see a [$\hat{\beta}_1$]-[unit] [increase/decrease] in [response variable].”

Intercept: “If the value of [explanatory variable] was 0, we would expect [response variable] to be [$\hat{\beta}_0$] [units].”

Car price: use

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \] How would we interpret the coefficients from this model? Does it make sense to interpret the intercept?

Interpretation sentence– categorical predictor

“Slope”: “Compared to [reference group], we would expect the [response] of [comparison group] to be [$\hat{\beta}_i$]-[units] [higher/lower].” “Intercept”: “We would expect the [response] of [reference group] to be [$\hat{\beta}_i0$]-[units].”

Car price: use

# A tibble: 8 × 2
  body              n
  <chr>         <int>
1 SUV              60
2 coupe            51
3 minivan          20
4 other             3
5 pickup truck     24
6 sedan           191
7 sports car       49
8 station wagon    30

lm(msrp~body, data = cars04)


Call:
lm(formula = msrp ~ body, data = cars04)

Coefficients:
      (Intercept)        bodyminivan          bodyother   bodypickup truck  
          28953.6            -1157.1           -11427.6            -4012.3  
        bodysedan     bodysports car  bodystation wagon            bodySUV  
           1283.6            24433.4             -113.1             5836.6

Fitting the model in R, again

The first time I fit the model, all I wanted were the coefficients

lm(msrp~weight, data = cars04)

But in this class, we are going to need to go a lot further. So, we need to store our model object in R and give it a name. I use kind of bad names (m1, m2, etc) you could choose a better name.

m1 <- lm(msrp~weight, data = cars04)

We’re using the assignment operator to store the results of our function into a named object.

I’m using the assignment operator <-, but you can also use =. As with many things, I’ll try to be consistent, but I sometimes switch between the two.

R: More assessing and using

If we want to do more assessing and using, we typically want to look at the model output (the fitted coefficients, etc). If we run summary() on our model object we can look at the output.

summary(m1)


Call:
lm(formula = msrp ~ weight, data = cars04)

Residuals:
   Min     1Q Median     3Q    Max 
-32870  -8283  -5023   3324 164823 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8383.095   4062.760  -2.063   0.0397 *  
weight         11.506      1.111  10.357   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17420 on 424 degrees of freedom
  (2 observations deleted due to missingness)
Multiple R-squared:  0.2019,    Adjusted R-squared:    0.2 
F-statistic: 107.3 on 1 and 424 DF,  p-value: < 2.2e-16

Notice that the coefficients are the same, they just show up in a column rather than printed out like a row. Let’s write out the model equation again.

Bonus– adding a line to a scatterplot

Sometimes it’s nice to be able to see the fitted line on the scatterplot. Note that putting the line on the scatterplot is not what I mean when I say to “fit the model” (I want you using lm() for that) but it’s a nice way to visualize what’s going on.

First, I’m going to show you another way to make the scatterplot we did before:

ggplot(cars04) +
  geom_point(aes(x = weight, y = msrp))

ggplot(cars04, aes(x = weight, y = msrp)) +
  geom_point()

Bonus– adding a line to a scatterplot

Now, we can add on another geometric object, a line, with the same aesthetics.

ggplot(cars04, aes(x = weight, y = msrp)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)