Model \[
Y = \beta_0 + \beta_1\cdot X + \epsilon, \, \epsilon \sim N(0, \sigma_\epsilon)
\] Estimate
\[
\hat{y} = \hat{\beta_0} + \hat{\beta_1}\cdot x
\]
residuals
observed - expected \[
y_i - \hat{y}_i
\]
Car price: exploratory data analysis
Let’s look at the relationship between the car’s weight (in pounds) and the MSRP (in dollars).
ggplot(cars04) +geom_point(aes(x = weight, y = msrp))
How could we describe this relationship?
Car price: fit
Given how the data looked in the scatterplot we saw above, it seems reasonable to choose a simple linear regression model. We can then fit it using R. Here’s how we fit a simple linear regression model:
To fit the model, R is minimizing the sum of the squared residuals,
\[
SSE = \sum_{i=1}^n (y_i - \hat{y}_i)^2, \qquad \hat{\sigma}_\epsilon = \sqrt{\frac{SSE}{n-2}}
\] Sometimes the SSE is also called the Residual Sum of Squares (RSS).
Car price: use
One thing I want you to be able to do is write out the equation of the line.
The XC70 actually cost $35,145 new, what would the residual for this car be? Did we overpredict or underpredict?
Interpretation sentences
I like to use ‘recipe’ sentences for interpretations. Here are two:
Slope: “For a 1-[unit] increase in [explanatory variable], we would expect to see a [\(\hat{\beta}_1\)]-[unit] [increase/decrease] in [response variable].”
Intercept: “If the value of [explanatory variable] was 0, we would expect [response variable] to be [\(\hat{\beta}_0\)] [units].”
Car price: use
\[
\widehat{\text{MSRP}} = -8383.10 + 11.51\cdot \text{weight}
\] How would we interpret the coefficients from this model? Does it make sense to interpret the intercept?
Fitting the model in R, again
The first time I fit the model, all I wanted were the coefficients
lm(msrp~weight, data = cars04)
But in this class, we are going to need to go a lot further. So, we need to store our model object in R and give it a name. I use kind of bad names (m1, m2, etc) you could choose a better name.
m1 <-lm(msrp~weight, data = cars04)
We’re using the assignment operator to store the results of our function into a named object.
I’m using the assignment operator <-, but you can also use =. As with many things, I’ll try to be consistent, but I sometimes switch between the two.
R: More assessing and using
If we want to do more assessing and using, we typically want to look at the model output (the fitted coefficients, etc). If we run summary() on our model object we can look at the output.
summary(m1)
Call:
lm(formula = msrp ~ weight, data = cars04)
Residuals:
Min 1Q Median 3Q Max
-32870 -8283 -5023 3324 164823
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -8383.095 4062.760 -2.063 0.0397 *
weight 11.506 1.111 10.357 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17420 on 424 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.2019, Adjusted R-squared: 0.2
F-statistic: 107.3 on 1 and 424 DF, p-value: < 2.2e-16
Notice that the coefficients are the same, they just show up in a column rather than printed out like a row. Let’s write out the model equation again.
Bonus– adding a line to a scatterplot
Sometimes it’s nice to be able to see the fitted line on the scatterplot. Note that putting the line on the scatterplot is not what I mean when I say to “fit the model” (I want you using lm() for that) but it’s a nice way to visualize what’s going on.
First, I’m going to show you another way to make the scatterplot we did before:
ggplot(cars04) +geom_point(aes(x = weight, y = msrp))ggplot(cars04, aes(x = weight, y = msrp)) +geom_point()
Bonus– adding a line to a scatterplot
Now, we can add on another geometric object, a line, with the same aesthetics.
ggplot(cars04, aes(x = weight, y = msrp)) +geom_point() +geom_smooth(method ="lm", se =FALSE)