Pathological examples of model condition violations

Because most real models are hard to assess, we want to examine some pathological examples of the LINE conditions for regression being violated.

For all these examples, we will generate simulated data to show the particular conditions being violated.

You’ll notice that even as we’re trying to simulate data to violate one condition, the others start to look bad, too. This is very common in the real world, as well!

Good Model

First, a positive example! Let’s make some good data.

n = 10000
beta0 = 10
beta1 = 3
x = runif(n)
e = rnorm(n)
ds = data.frame(y = beta0 + beta1 * x + e)

Now, we can look at the relationship and check the conditions.

ggplot(ds, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm, se=FALSE)

mod = lm(y ~ x, data=ds)
plot(mod, which=1)

plot(mod, which=2)

ggplot(ds, aes(x=mod$residuals)) + geom_histogram()

Linearity

ds = data.frame(y = beta0 + beta1 * x + 15*x^2 + e)
ggplot(ds, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm, se=FALSE)

mod = lm(y ~ x, data=ds)
plot(mod, which=1)

plot(mod, which=2)

ggplot(ds, aes(x=mod$residuals)) + geom_histogram()

Constant Variance

ds = data.frame(y = beta0 + beta1 * x + e*x)
ggplot(ds, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm, se=FALSE)

mod = lm(y ~ x, data=ds)
plot(mod, which=1)

plot(mod, which=2)

ggplot(ds, aes(x=mod$residuals)) + geom_histogram()

Normality

ds = data.frame(y = beta0 + beta1 * x + rbeta(n, shape1 = 2, shape2 = 5))
ggplot(ds, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm, se=FALSE)

mod = lm(y ~ x, data=ds)
plot(mod, which=1)

plot(mod, which=2)

ggplot(ds, aes(x=mod$residuals)) + geom_histogram()

How much is too much?

How loosely should we allow the conditions for regression to be violated? To get a feel for this, consider the following simulation. Note that, by construction, the data are generated from the model: \[ Y = 10 + 3 \cdot X + \epsilon, \qquad \epsilon \sim N(0, 1) \]

n = 20
ds = data.frame(x = rnorm(n))
ds = ds %>%
  mutate(y = 10 + 3 * x + rnorm(n))

ggplot(ds, aes(x=x, y=y)) + geom_point() + stat_smooth(method=lm, se=FALSE)

mod1 = lm(y ~ x, data=ds)
plot(mod1, which=1)

plot(mod1, which=2)

ggplot(ds, aes(x=mod1$residuals)) + geom_histogram()

Run this chunk several times, with several different choices of \(n\). What do you notice?