Lecture 03: Simple linear regression

Professor McNamara

Four-step process

  • Choose
  • Fit
  • Assess
  • Use

Choose

In this course, we will work in two primary paradigms:

  • Quantitative response variable
  • Categorical response variable

The model we choose will depend on what type of response variable we have.

Model notation

\[ y = f(x) + \epsilon \]

  • \(Y\):
  • \(f\):
  • \(X\):
  • \(\epsilon\):

Car data example

  • What type of model would we use to predict the Manufacturer’s Suggested Retail Price (MSRP)?
  • To predict whether or not the car is a sports car?
  • To predict the horsepower?

Simple linear regression

For simple linear regression, the model

\[ Y = f(X) + \epsilon \] has the following qualities:

  • \(Y\):
  • \(f\):
  • \(X\):
  • \(\epsilon\):

Parameters and statistics

  • parameters

  • statistics

More notation

  • \(Y,X\):
  • \(y,x\):
  • \(y_i,x_i\):
  • \(\bar{y}, \bar{x}\): \[ \bar{y} = \frac{1}{n} \sum_{i=1}^n y_i \]
  • \(\hat{y}\):

slr model notation

Model \[ Y = \beta_0 + \beta_1\cdot X + \epsilon, \, \epsilon \sim N(0, \sigma_\epsilon) \] Estimate

\[ \hat{y} = \hat{\beta_0} + \hat{\beta_1}\cdot x \]

residuals

observed - expected \[ y_i - \hat{y}_i \]

Car price: exploratory data analysis

Let’s look at the relationship between the car’s weight (in pounds) and the MSRP (in dollars).

ggplot(cars04) +
  geom_point(aes(x = weight, y = msrp))

How could we describe this relationship?

Car price: fit

Given how the data looked in the scatterplot we saw above, it seems reasonable to choose a simple linear regression model. We can then fit it using R. Here’s how we fit a simple linear regression model:

lm(msrp~weight, data = cars04)

Least squares regression

To fit the model, R is minimizing the sum of the squared residuals,

\[ SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2, \qquad \hat{\sigma}_\epsilon = \sqrt{\frac{SSE}{n-2}} \] Sometimes the SSE is also called the Residual Sum of Squares (RSS).

Car price: use

One thing I want you to be able to do is write out the equation of the line.

lm(msrp~weight, data = cars04)

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

Car price: use

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

  • What would this model predict for a Volvo XC70, which weighs 3823 pounds?

Car price: assess

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \]

  • The XC70 actually cost $35,145 new, what would the residual for this car be? Did we overpredict or underpredict?

Interpretation sentences

I like to use ‘recipe’ sentences for interpretations. Here are two:

Slope: “For a 1-[unit] increase in [explanatory variable], we would expect to see a [\(\hat{\beta}_1\)]-[unit] [increase/decrease] in [response variable].”

Intercept: “If the value of [explanatory variable] was 0, we would expect [response variable] to be [\(\hat{beta}_0\)] [units].”

Car price: use

\[ \widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight} \] How would we interpret the coefficients from this model? Does it make sense to interpret the intercept?

Fitting the model in R, again

The first time I fit the model, all I wanted were the coefficients

lm(msrp~weight, data = cars04)

But in this class, we are going to need to go a lot further. So, we need to store our model object in R and give it a name. I use kind of bad names (m1, m2, etc) you could choose a better name.

m1 <- lm(msrp~weight, data = cars04)

We’re using the assignment operator to store the results of our function into a named object.

I’m using the assignment operator <-, but you can also use =. As with many things, I’ll try to be consistent, but I sometimes switch between the two.

R: More assessing and using

Now, we want to move on to assessing and using our model. Typically, this means we want to look at the model output (the fitted coefficients, etc). If we run summary() on our model object we can look at the output.

summary(m1)

Notice that the coefficents are the same, they just show up in a column rather than printed out like a row. Let’s write out the model equation again.