Data Analysis with R

Content

The linear model lm()
Simple regression
- continuous predictors
- categorical predictors
Multiple regression
- main effect models
- interaction models
Residual plots

Linear Model

The linear regression model:

\[y_i=\beta_0+\beta_{1} x_{i1}+\beta_2x_{i2}+\dots+\varepsilon_i, \ \ \ \ \ \ \varepsilon_i\sim N(0, \sigma^2)\] where

$y_i$ is score of individual $i$ on the numeric dependent variable $Y$
$x_{i1}$ is the score of individual $i$ on predictor $X_1$
$\beta_0$ is the intercept
$\beta_1$ is the slope of $X_1$
$\varepsilon_{i}$ is the residual (prediction error)
$\sigma^2$ is the variance of the residuals

The `lm()` function

lm(formula, data) # returns an object of class lm

formula	model
`y ~ 1`	intercept-only
`y ~ x1`	main effect of x1
`y ~ x1 + x2`	main effects of x1, x2
`y ~ .`	main effects of all predictors
`y ~ . - x1`	main effects of all predictors except x1
`y ~ x1 + x2 + x1:x2`	main effects + interaction between x1 and x2
`y ~ x1*x2`	idem
`y ~ .^2`	main effects + pairwise interactions between all predictors

Simple regression

Continuous predictor

\[y_i=\hat{y}_i+\varepsilon_i\] where \[\hat{y}_i=\beta_0 + \beta_1x_i\]

$\hat{y}_i$ is the fitted value of $y_i$ for $X=x_i$
$x_i$ is the score on the numeric predictor
$\varepsilon_i$ is the residual
$\beta_0$ is the fitted value for $X = 0$
$\beta_1$ is the change in the fitted value for unit increase in $X$

`mtcars` data

The help file of the mtcars data provides the following information on the data:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

We will consider the following three variables of this data set:

mpg (numerical): miles per gallon, the determinant of fuel efficiency.
disp (numerical): displacement measures overall volume in the engine as a factor of cylinder circumference, depth and total number of cylinders. This metric gives a good proxy for the total amount of power the engine can generate.
gear (numeric): Number of forward gears. Number of gears in the transmission. Manual transmissions have either 4 or 5 forward gears; Automatic either 3 or 4.

Information about the data set

str(mtcars)

'data.frame':   32 obs. of  11 variables:
 $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
 $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
 $ disp: num  160 160 108 258 360 ...
 $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
 $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
 $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
 $ qsec: num  16.5 17 18.6 19.4 17 ...
 $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
 $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
 $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
 $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

Linear model

lm(formula = mpg ~ disp, data = mtcars)

Call:
lm(formula = mpg ~ disp, data = mtcars)

Coefficients:
(Intercept)         disp  
   29.59985     -0.04122

For disp = 0, the predicted value of mpg is 29.6

For each unit increase in disp, the predicted value of mpg decreases with 0.041

(i.e. cars with higher engine power consume more fuel)

The `lm` object

Extracting information from an lm object called fit:

Function / Subsetting	Output
`summary(fit)`	summary of the model
`coef(fit) / fit$coef`	coefficients
`coef(summary(fit)) / summary(fit)$coef`	coefficients, SEs, $t$-values, $p$-values
`fitted(fit) / fit$fitted`	fitted values
`resid(fit) / fit$resid`	residuals
`anova(fit)`	ANOVA table
`plot(fit)`	residual plots

Extracting the coefficients

summary(lm(mpg ~ disp, mtcars))$coef %>% round(3)

            Estimate Std. Error t value Pr(>|t|)
(Intercept)   29.600      1.230  24.070        0
disp          -0.041      0.005  -8.747        0

The intercept is significantly different from 0

The slope is significantly different from 0

Scatterplot

Display a scatter plot with linear regression line for the previous model.

Categorical predictor

A predictor with $k$ categories is converted into $k-1$ dummy variables.

one dummy per category, except for the reference category
dummy has value $1$ for corresponding category and $0$ otherwise
the reference category has value $0$ on all dummies

Reference category of $z$ is $a$

Interpreting dummies

Model with two dummies $zb$ and $zc$:

\[\hat{y}=\beta_0+\beta_1zb+\beta_2zc\]

$\beta_0$ is fitted value for $z=a$ (reference category)
$\beta_0+\beta_1$ is the fitted value for $z=b$
$\beta_0+\beta_2$ is the fitted value for $z=c$

Linear model with `gear` as factor

gear is a categorical variable and has values 3, 4, 5, and has data type numeric.

lm(mpg ~ factor(gear), mtcars)

Call:
lm(formula = mpg ~ factor(gear), data = mtcars)

Coefficients:
  (Intercept)  factor(gear)4  factor(gear)5  
       16.107          8.427          5.273

mean mpg for 3 gears is 16.107
mean mpg for 4 gears is 16.107 + 8.427
mean mpg for 5 gears is 16.107 + 5.273

Regression lines

Two plots, each with reference category gear = 3

Model summary

Summary of model statistics

Call:
lm(formula = mpg ~ factor(gear), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.7333 -3.2333 -0.9067  2.8483  9.3667 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     16.107      1.216  13.250 7.87e-14 ***
factor(gear)4    8.427      1.823   4.621 7.26e-05 ***
factor(gear)5    5.273      2.431   2.169   0.0384 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.708 on 29 degrees of freedom
Multiple R-squared:  0.4292,    Adjusted R-squared:  0.3898 
F-statistic:  10.9 on 2 and 29 DF,  p-value: 0.0002948

Model comparisons with `anova()`

Which model fits better:

mpg ~ 1
mpg ~ gear
mpg ~ factor(gear)

Each subsequent model has one more parameter, so it is expected to fit better.

But is it significantly better?

We can check this with an $F$-test for the $R^2$-change

The function to do this test is anova(model 1, model 2, model 3)

Which model fits best?

Test the $R^2$-change for the 3 models on the previous slide

which model fits best?

anova(lm(formula = mpg ~ 1, data = mtcars),
      lm(formula = mpg ~ gear, data = mtcars),
      lm(formula = mpg ~ factor(gear), data = mtcars))

Analysis of Variance Table

Model 1: mpg ~ 1
Model 2: mpg ~ gear
Model 3: mpg ~ factor(gear)
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1     31 1126.0                                
2     30  866.3  1    259.75 11.719 0.001864 **
3     29  642.8  1    223.49 10.083 0.003534 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model 3 fits best.

Multiple regression

Main-effects model

No interactions in the model, e.g.

\[\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2\]

Interpretation

the slope of $x_1$ does not depend on $x_2$
the slope of $x_2$ does not depend on $x_1$
the slopes of $x_1$ and $x_2$ are corrected for the correlation $r_{x_1,x_2}$

Model with `disp` and `gear`

Slopes corrected for correlation disp and gear $(r = -0.56)$

Slope for disp and gear in simple regression models

c(lm(formula = mpg ~ disp, data = mtcars)$coef[2], 
  lm(formula = mpg ~ gear, data = mtcars)$coef[2])

       disp        gear 
-0.04121512  3.92333333

Slope for disp and gear in multiple regression model

lm(formula = mpg ~ disp + gear, data = mtcars)$coef[2:3]

       disp        gear 
-0.04084734  0.11120218

Interaction model

An interaction effect is (literally) the product of two predictors:

\[\hat{y}=\beta_0+\beta_1x_1+\beta_2x_2 + \beta_{12}x_1x_2\]

The main effects should always be included too!

Interpreting the interaction effect

The slope of $x_1$ is $\beta_1+\beta_{12}x_2$.
The slope of $x_2$ is $\beta_2+\beta_{12}x_1$.

Model summary

summary(lm(formula = mpg ~ disp * factor(gear), data = mtcars))

Call:
lm(formula = mpg ~ disp * factor(gear), data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5986 -1.5990 -0.0143  1.6329  4.9926 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        24.515566   2.462431   9.956 2.32e-10 ***
disp               -0.025770   0.007265  -3.547 0.001505 ** 
factor(gear)4      15.051963   3.558043   4.230 0.000256 ***
factor(gear)5       7.145380   3.535913   2.021 0.053711 .  
disp:factor(gear)4 -0.096442   0.021261  -4.536 0.000114 ***
disp:factor(gear)5 -0.025005   0.013320  -1.877 0.071742 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.579 on 26 degrees of freedom
Multiple R-squared:  0.8465,    Adjusted R-squared:  0.817 
F-statistic: 28.67 on 5 and 26 DF,  p-value: 8.452e-10

Regression plots

Compare the models

Compare and interpret the fit of the following models

intercept-only
model with disp predicting mpg
main effects model with disp and factor(gear) predicting mpg
interaction model with disp and factor(gear) predicting mpg

Results

anova(lm(formula = mpg ~ 1, data = mtcars),
      lm(formula = mpg ~ disp, mtcars),
      lm(formula = mpg ~ disp + factor(gear), mtcars),
      lm(formula = mpg ~ disp * factor(gear), mtcars))

Analysis of Variance Table

Model 1: mpg ~ 1
Model 2: mpg ~ disp
Model 3: mpg ~ disp + factor(gear)
Model 4: mpg ~ disp * factor(gear)
  Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
1     31 1126.05                                   
2     30  317.16  1    808.89 121.657 2.651e-11 ***
3     28  317.01  2      0.15   0.011 0.9890201    
4     26  172.87  2    144.14  10.839 0.0003771 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual plots

Residuals vs Fitted

linearity

approximately straight red line at 0

Normal Q-Q

Normally distributed residuals

residuals approximately on diagonal

Scale-Location**

Homoscedasticity (constant variance residuals)

approximately straight red line

Residuals vs Leverage

Influential data points

data points with a Cook’s Distance larger than 1

Content

Linear Model

The lm() function

Simple regression

Continuous predictor

mtcars data

Information about the data set

Linear model

The lm object

Extracting the coefficients

Scatterplot

Categorical predictor

Interpreting dummies

Linear model with gear as factor

Regression lines

Model summary

Model comparisons with anova()

Which model fits best?

Multiple regression

Main-effects model

Model with disp and gear

Interaction model

Model summary

Regression plots

Compare the models

Results

Residual plots

Residuals vs Fitted

Normal Q-Q

Scale-Location**

Residuals vs Leverage

The `lm()` function

`mtcars` data

The `lm` object

Linear model with `gear` as factor

Model comparisons with `anova()`

Model with `disp` and `gear`