In this practical, we will compare the estimates of the linear and the generalized linear model when the dependent variable is a dichotomous variable. The data set for this practical is lungcap from the package GLMsData. To load the data into workspace you need to use the function data().

data(lungcap)

1 Lungcap

The data set lungcap contains data on four characteristics of smokers and non-smokers (check its help page for details). In this exercise, we will use these characteristics to predict the probability of being a smoker.

  1. Display the summary of the data, and check the variables for skew.
summary(lungcap)
      Age              FEV              Ht        Gender      Smoke        
 Min.   : 3.000   Min.   :0.791   Min.   :46.00   F:318   Min.   :0.00000  
 1st Qu.: 8.000   1st Qu.:1.981   1st Qu.:57.00   M:336   1st Qu.:0.00000  
 Median :10.000   Median :2.547   Median :61.50           Median :0.00000  
 Mean   : 9.931   Mean   :2.637   Mean   :61.14           Mean   :0.09939  
 3rd Qu.:12.000   3rd Qu.:3.119   3rd Qu.:65.50           3rd Qu.:0.00000  
 Max.   :19.000   Max.   :5.793   Max.   :74.00           Max.   :1.00000  
  1. Display a frequency table for the variable Smoke to see how many smokers and non-smokers there are in the data.
xtabs(~ Smoke, lungcap)
Smoke
  0   1 
589  65 

1.1 Linear model

We will start again by treating the variable Smoke as continuous with a distribution from the Gaussian family.

  1. Fit the generalized linear model Smoke ~ Age with the argument family = "gaussian", and display and interpret its summary.
linear <- glm(Smoke ~ Age, "gaussian", lungcap)
summary(linear)

Call:
glm(formula = Smoke ~ Age, family = "gaussian", data = lungcap)

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.307544   0.037617  -8.176 1.54e-15 ***
Age          0.040975   0.003631  11.286  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for gaussian family taken to be 0.07511225)

    Null deviance: 58.540  on 653  degrees of freedom
Residual deviance: 48.973  on 652  degrees of freedom
AIC: 166.91

Number of Fisher Scoring iterations: 2
  1. Display a scatterplot with Age on the x-axis and Smoke on the y-axis, and add the linear regression line regression line.
ggplot(lungcap, aes(Age, Smoke)) + 
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

  1. Display and interpret the diagnostic residual plots for the linear model.
par(mfrow = c(2, 2))
plot(linear)

1.2 Logistic regression model

We will now treat Smoke as a dichotomous variable with glm().

  1. Fit the logistic regression model Smoke ~ Age, and display and interpret its summary.
logistic <- glm(Smoke ~ Age, "binomial", lungcap)
summary(logistic)

Call:
glm(formula = Smoke ~ Age, family = "binomial", data = lungcap)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -7.74391    0.70890 -10.924   <2e-16 ***
Age          0.48364    0.05513   8.773   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 423.45  on 653  degrees of freedom
Residual deviance: 318.56  on 652  degrees of freedom
AIC: 322.56

Number of Fisher Scoring iterations: 6
  1. Display a scatterplot with Age on the x-axis and Smoke on the y-axis. Add a line representing the probability of smoking predicted by the logistic regression model, and compare (the range of) the predicted values to those of the linear model.
lungcap %>% 
  mutate(pred = predict(logistic, type = "response")) %>% 
  ggplot() + 
  geom_point(aes(Age, Smoke)) +
  geom_line(aes(Age, pred), col = "red") +
  theme_minimal()

1.3 Model selection

  1. Find the most parsimonious main-effects logistic regression model when all variables in lungcap are allowed to enter the model as predictors.
step(logistic, scope = Smoke ~ Age + FEV + Ht + Gender)
Start:  AIC=322.56
Smoke ~ Age

         Df Deviance    AIC
+ Gender  1   311.68 317.68
<none>        318.56 322.56
+ Ht      1   317.06 323.06
+ FEV     1   317.43 323.43
- Age     1   423.45 425.45

Step:  AIC=317.68
Smoke ~ Age + Gender

         Df Deviance    AIC
+ Ht      1   303.03 311.03
<none>        311.68 317.68
+ FEV     1   311.61 319.61
- Gender  1   318.56 322.56
- Age     1   419.69 423.69

Step:  AIC=311.03
Smoke ~ Age + Gender + Ht

         Df Deviance    AIC
+ FEV     1   300.02 310.02
<none>        303.03 311.03
- Ht      1   311.68 317.68
- Gender  1   317.06 323.06
- Age     1   341.01 347.01

Step:  AIC=310.02
Smoke ~ Age + Gender + Ht + FEV

         Df Deviance    AIC
<none>        300.02 310.02
- FEV     1   303.03 311.03
- Gender  1   309.82 317.82
- Ht      1   311.61 319.61
- Age     1   340.95 348.95

Call:  glm(formula = Smoke ~ Age + Gender + Ht + FEV, family = "binomial", 
    data = lungcap)

Coefficients:
(Intercept)          Age      GenderM           Ht          FEV  
   -18.6141       0.4308      -1.1652       0.2122      -0.5452  

Degrees of Freedom: 653 Total (i.e. Null);  649 Residual
Null Deviance:      423.4 
Residual Deviance: 300  AIC: 310
  1. Find the most parsimonious logistic regression model allowing all pairwise interactions between the predictors to eneter the model.
step(logistic, scope = Smoke ~ (Age + FEV + Ht + Gender)^2)
Start:  AIC=322.56
Smoke ~ Age

         Df Deviance    AIC
+ Gender  1   311.68 317.68
<none>        318.56 322.56
+ Ht      1   317.06 323.06
+ FEV     1   317.43 323.43
- Age     1   423.45 425.45

Step:  AIC=317.68
Smoke ~ Age + Gender

             Df Deviance    AIC
+ Ht          1   303.03 311.03
<none>            311.68 317.68
+ FEV         1   311.61 319.61
+ Age:Gender  1   311.64 319.64
- Gender      1   318.56 322.56
- Age         1   419.69 423.69

Step:  AIC=311.03
Smoke ~ Age + Gender + Ht

             Df Deviance    AIC
+ Age:Ht      1   293.67 303.67
+ Ht:Gender   1   298.93 308.93
+ FEV         1   300.02 310.02
<none>            303.03 311.03
+ Age:Gender  1   302.58 312.58
- Ht          1   311.68 317.68
- Gender      1   317.06 323.06
- Age         1   341.01 347.01

Step:  AIC=303.67
Smoke ~ Age + Gender + Ht + Age:Ht

             Df Deviance    AIC
+ Age:Gender  1   290.58 302.58
+ Ht:Gender   1   291.30 303.30
<none>            293.67 303.67
+ FEV         1   291.91 303.91
- Gender      1   301.46 309.46
- Age:Ht      1   303.04 311.04

Step:  AIC=302.58
Smoke ~ Age + Gender + Ht + Age:Ht + Age:Gender

             Df Deviance    AIC
+ Ht:Gender   1   286.85 300.85
+ FEV         1   287.88 301.88
<none>            290.58 302.58
- Age:Gender  1   293.67 303.67
- Age:Ht      1   302.57 312.57

Step:  AIC=300.85
Smoke ~ Age + Gender + Ht + Age:Ht + Age:Gender + Gender:Ht

             Df Deviance    AIC
<none>            286.85 300.85
+ FEV         1   285.23 301.23
- Gender:Ht   1   290.58 302.58
- Age:Gender  1   291.30 303.30
- Age:Ht      1   298.88 310.88

Call:  glm(formula = Smoke ~ Age + Gender + Ht + Age:Ht + Age:Gender + 
    Gender:Ht, family = "binomial", data = lungcap)

Coefficients:
(Intercept)          Age      GenderM           Ht       Age:Ht  Age:GenderM  
  -83.28638      5.34130      9.53759      1.20813     -0.07744      0.35773  
 GenderM:Ht  
   -0.22838  

Degrees of Freedom: 653 Total (i.e. Null);  647 Residual
Null Deviance:      423.4 
Residual Deviance: 286.9    AIC: 300.9
  1. Which models is the most parsimonious, the one with or the one without the pairwise interactions?

End of practical