Content

  1. The linear model lm()

  2. Simple regression

    • continuous predictors

    • categorical predictors

  3. Multiple regression

    • main effect models

    • interaction models

  4. Residual plots

Linear Model

The linear regression model:

\[y_i=\beta_0+\beta_{1} x_{i1}+\beta_2x_{i2}+\dots+\varepsilon_i, \ \ \ \ \ \ \varepsilon_i\sim N(0, \sigma^2)\] where

  • \(y_i\) is score of individual \(i\) on the numeric dependent variable \(Y\)

  • \(x_{i1}\) is the score of individual \(i\) on predictor \(X_1\)

  • \(\beta_0\) is the intercept

  • \(\beta_1\) is the slope of \(X_1\)

  • \(\varepsilon_{i}\) is the residual (prediction error)

  • \(\sigma^2\) is the variance of the residuals

The lm() function

lm(formula, data) # returns an object of class lm
formula model
y ~ 1 intercept-only
y ~ x1 main effect of x1
y ~ x1 + x2 main effects of x1, x2
y ~ . main effects of all predictors
y ~ . - x1 main effects of all predictors except x1
y ~ x1 + x2 + x1:x2 main effects + interaction between x1 and x2
y ~ x1*x2 idem
y ~ .^2 main effects + pairwise interactions between all predictors

Simple regression

Continuous predictor

\[y_i=\hat{y}_i+\varepsilon_i\] where \[\hat{y}_i=\beta_0 + \beta_1x_i\]


  • \(\hat{y}_i\) is the fitted value of \(y_i\) for \(X=x_i\)

  • \(x_i\) is the score on the numeric predictor

  • \(\varepsilon_i\) is the residual

  • \(\beta_0\) is the fitted value for \(X = 0\)

  • \(\beta_1\) is the change in the fitted value for unit increase in \(X\)

mtcars data

The help file of the mtcars data provides the following information on the data:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

We will consider the following three variables of this data set:

  • mpg (numerical): miles per gallon, the determinant of fuel efficiency.
  • disp (numerical): displacement measures overall volume in the engine as a factor of cylinder circumference, depth and total number of cylinders. This metric gives a good proxy for the total amount of power the engine can generate.
  • gear (numeric): Number of forward gears. Number of gears in the transmission. Manual transmissions have either 4 or 5 forward gears; Automatic either 3 or 4.

Information about the data set

str(mtcars)
'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Linear model

lm(formula = mpg ~ disp, data = mtcars)
Call:
lm(formula = mpg ~ disp, data = mtcars)

Coefficients:
(Intercept)         disp  
   29.59985     -0.04122  

For disp = 0, the predicted value of mpg is 29.6

For each unit increase in disp, the predicted value of mpg decreases with 0.041

(i.e. cars with higher engine power consume more fuel)

The lm object

Extracting information from an lm object called fit:

Function / Subsetting Output
summary(fit) summary of the model
coef(fit) / fit$coef coefficients
coef(summary(fit)) / summary(fit)$coef coefficients, SEs, \(t\)-values, \(p\)-values
fitted(fit) / fit$fitted fitted values
resid(fit) / fit$resid residuals
anova(fit) ANOVA table
plot(fit) residual plots

Extracting the coefficients

summary(lm(mpg ~ disp, mtcars))$coef %>% round(3)
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   29.600      1.230  24.070        0
disp          -0.041      0.005  -8.747        0

The intercept is significantly different from 0

The slope is significantly different from 0

Scatterplot

Display a scatter plot with linear regression line for the previous model.

Categorical predictor

A predictor with \(k\) categories is converted into \(k-1\) dummy variables.

  • one dummy per category, except for the reference category

  • dummy has value \(1\) for corresponding category and \(0\) otherwise

  • the reference category has value \(0\) on all dummies


Reference category of $z$ is $a$

Reference category of \(z\) is \(a\)

Interpreting dummies

Model with two dummies \(zb\) and \(zc\):

\[\hat{y}=\beta_0+\beta_1zb+\beta_2zc\]


  • \(\beta_0\) is fitted value for \(z=a\) (reference category)

  • \(\beta_0+\beta_1\) is the fitted value for \(z=b\)

  • \(\beta_0+\beta_2\) is the fitted value for \(z=c\)

Linear model with gear as factor

gear is a categorical variable and has values 3, 4, 5, and has data type numeric.

lm(mpg ~ factor(gear), mtcars)
Call:
lm(formula = mpg ~ factor(gear), data = mtcars)

Coefficients:
  (Intercept)  factor(gear)4  factor(gear)5  
       16.107          8.427          5.273  
  • mean mpg for 3 gears is 16.107

  • mean mpg for 4 gears is 16.107 + 8.427

  • mean mpg for 5 gears is 16.107 + 5.273

Regression lines

Two plots, each with reference category gear = 3

Model summary

Summary of model statistics

Call:
lm(formula = mpg ~ factor(gear), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7333 -3.2333 -0.9067  2.8483  9.3667 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     16.107      1.216  13.250 7.87e-14 ***
factor(gear)4    8.427      1.823   4.621 7.26e-05 ***
factor(gear)5    5.273      2.431   2.169   0.0384 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.708 on 29 degrees of freedom
Multiple R-squared:  0.4292,    Adjusted R-squared:  0.3898 
F-statistic:  10.9 on 2 and 29 DF,  p-value: 0.0002948

Model comparisons with anova()

Which model fits better:

  1. mpg ~ 1

  2. mpg ~ gear

  3. mpg ~ factor(gear)


Each subsequent model has one more parameter, so it is expected to fit better.

  • But is it significantly better?


We can check this with an \(F\)-test for the \(R^2\)-change

  • The function to do this test is anova(model 1, model 2, model 3)

Which model fits best?

Test the \(R^2\)-change for the 3 models on the previous slide

  • which model fits best?
anova(lm(formula = mpg ~ 1, data = mtcars),
      lm(formula = mpg ~ gear, data = mtcars),
      lm(formula = mpg ~ factor(gear), data = mtcars))
Analysis of Variance Table

Model 1: mpg ~ 1
Model 2: mpg ~ gear
Model 3: mpg ~ factor(gear)
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     31 1126.0                                
2     30  866.3  1    259.75 11.719 0.001864 **
3     29  642.8  1    223.49 10.083 0.003534 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 3 fits best.

Multiple regression

Main-effects model

No interactions in the model, e.g.

\[\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2\]

Interpretation

  • the slope of \(x_1\) does not depend on \(x_2\)

  • the slope of \(x_2\) does not depend on \(x_1\)

  • the slopes of \(x_1\) and \(x_2\) are corrected for the correlation \(r_{x_1,x_2}\)

Model with disp and gear

Slopes corrected for correlation disp and gear \((r = -0.56)\)

Slope for disp and gear in simple regression models

c(lm(formula = mpg ~ disp, data = mtcars)$coef[2], 
  lm(formula = mpg ~ gear, data = mtcars)$coef[2])
       disp        gear 
-0.04121512  3.92333333 

Slope for disp and gear in multiple regression model

lm(formula = mpg ~ disp + gear, data = mtcars)$coef[2:3]
       disp        gear 
-0.04084734  0.11120218 

Interaction model

An interaction effect is (literally) the product of two predictors:

\[\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2 + \beta_{12}x_1x_2\]

  • The main effects should always be included too!

Interpreting the interaction effect

  • The slope of \(x_1\) is \(\beta_1+\beta_{12}x_2\).

  • The slope of \(x_2\) is \(\beta_2+\beta_{12}x_1\).

Model summary

summary(lm(formula = mpg ~ disp * factor(gear), data = mtcars)) 
Call:
lm(formula = mpg ~ disp * factor(gear), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5986 -1.5990 -0.0143  1.6329  4.9926 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        24.515566   2.462431   9.956 2.32e-10 ***
disp               -0.025770   0.007265  -3.547 0.001505 ** 
factor(gear)4      15.051963   3.558043   4.230 0.000256 ***
factor(gear)5       7.145380   3.535913   2.021 0.053711 .  
disp:factor(gear)4 -0.096442   0.021261  -4.536 0.000114 ***
disp:factor(gear)5 -0.025005   0.013320  -1.877 0.071742 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.579 on 26 degrees of freedom
Multiple R-squared:  0.8465,    Adjusted R-squared:  0.817 
F-statistic: 28.67 on 5 and 26 DF,  p-value: 8.452e-10

Regression plots

Compare the models

Compare and interpret the fit of the following models

  • intercept-only

  • model with disp predicting mpg

  • main effects model with disp and factor(gear) predicting mpg

  • interaction model with disp and factor(gear) predicting mpg

Results

anova(lm(formula = mpg ~ 1, data = mtcars),
      lm(formula = mpg ~ disp, mtcars),
      lm(formula = mpg ~ disp + factor(gear), mtcars),
      lm(formula = mpg ~ disp * factor(gear), mtcars))
Analysis of Variance Table

Model 1: mpg ~ 1
Model 2: mpg ~ disp
Model 3: mpg ~ disp + factor(gear)
Model 4: mpg ~ disp * factor(gear)
  Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
1     31 1126.05                                   
2     30  317.16  1    808.89 121.657 2.651e-11 ***
3     28  317.01  2      0.15   0.011 0.9890201    
4     26  172.87  2    144.14  10.839 0.0003771 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual plots

Residuals vs Fitted

linearity

  • approximately straight red line at 0

Normal Q-Q

Normally distributed residuals

  • residuals approximately on diagonal

Scale-Location**

Homoscedasticity (constant variance residuals)

  • approximately straight red line

Residuals vs Leverage

Influential data points

  • data points with a Cook’s Distance larger than 1