In this practical, we will compare the estimates of the linear and
the generalized linear model when the dependent variable is a
dichotomous variable. The data set for this practical is
lungcap
from the package GLMsData
. To load the
data into workspace you need to use the function
data()
.
1 Lungcap
The data set lungcap
contains data on four
characteristics of smokers and non-smokers (check its help page for
details). In this exercise, we will use these characteristics to predict
the probability of being a smoker.
- Display the summary of the data, and check the variables for skew.
Age FEV Ht Gender Smoke
Min. : 3.000 Min. :0.791 Min. :46.00 F:318 Min. :0.00000
1st Qu.: 8.000 1st Qu.:1.981 1st Qu.:57.00 M:336 1st Qu.:0.00000
Median :10.000 Median :2.547 Median :61.50 Median :0.00000
Mean : 9.931 Mean :2.637 Mean :61.14 Mean :0.09939
3rd Qu.:12.000 3rd Qu.:3.119 3rd Qu.:65.50 3rd Qu.:0.00000
Max. :19.000 Max. :5.793 Max. :74.00 Max. :1.00000
- Display a frequency table for the variable
Smoke
to see how many smokers and non-smokers there are in the data.
Smoke
0 1
589 65
1.1 Linear model
We will start again by treating the variable Smoke
as
continuous with a distribution from the Gaussian family.
- Fit the generalized linear model
Smoke ~ Age
with the argumentfamily = "gaussian"
, and display and interpret its summary.
Call:
glm(formula = Smoke ~ Age, family = "gaussian", data = lungcap)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.307544 0.037617 -8.176 1.54e-15 ***
Age 0.040975 0.003631 11.286 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 0.07511225)
Null deviance: 58.540 on 653 degrees of freedom
Residual deviance: 48.973 on 652 degrees of freedom
AIC: 166.91
Number of Fisher Scoring iterations: 2
- Display a scatterplot with
Age
on the x-axis andSmoke
on the y-axis, and add the linear regression line regression line.
`geom_smooth()` using formula = 'y ~ x'
- Display and interpret the diagnostic residual plots for the linear model.
1.2 Logistic regression model
We will now treat Smoke
as a dichotomous variable with
glm()
.
- Fit the logistic regression model
Smoke ~ Age
, and display and interpret its summary.
Call:
glm(formula = Smoke ~ Age, family = "binomial", data = lungcap)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -7.74391 0.70890 -10.924 <2e-16 ***
Age 0.48364 0.05513 8.773 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 423.45 on 653 degrees of freedom
Residual deviance: 318.56 on 652 degrees of freedom
AIC: 322.56
Number of Fisher Scoring iterations: 6
- Display a scatterplot with
Age
on the x-axis andSmoke
on the y-axis. Add a line representing the probability of smoking predicted by the logistic regression model, and compare (the range of) the predicted values to those of the linear model.
lungcap %>%
mutate(pred = predict(logistic, type = "response")) %>%
ggplot() +
geom_point(aes(Age, Smoke)) +
geom_line(aes(Age, pred), col = "red") +
theme_minimal()
1.3 Model selection
- Find the most parsimonious main-effects logistic regression model
when all variables in
lungcap
are allowed to enter the model as predictors.
Start: AIC=322.56
Smoke ~ Age
Df Deviance AIC
+ Gender 1 311.68 317.68
<none> 318.56 322.56
+ Ht 1 317.06 323.06
+ FEV 1 317.43 323.43
- Age 1 423.45 425.45
Step: AIC=317.68
Smoke ~ Age + Gender
Df Deviance AIC
+ Ht 1 303.03 311.03
<none> 311.68 317.68
+ FEV 1 311.61 319.61
- Gender 1 318.56 322.56
- Age 1 419.69 423.69
Step: AIC=311.03
Smoke ~ Age + Gender + Ht
Df Deviance AIC
+ FEV 1 300.02 310.02
<none> 303.03 311.03
- Ht 1 311.68 317.68
- Gender 1 317.06 323.06
- Age 1 341.01 347.01
Step: AIC=310.02
Smoke ~ Age + Gender + Ht + FEV
Df Deviance AIC
<none> 300.02 310.02
- FEV 1 303.03 311.03
- Gender 1 309.82 317.82
- Ht 1 311.61 319.61
- Age 1 340.95 348.95
Call: glm(formula = Smoke ~ Age + Gender + Ht + FEV, family = "binomial",
data = lungcap)
Coefficients:
(Intercept) Age GenderM Ht FEV
-18.6141 0.4308 -1.1652 0.2122 -0.5452
Degrees of Freedom: 653 Total (i.e. Null); 649 Residual
Null Deviance: 423.4
Residual Deviance: 300 AIC: 310
- Find the most parsimonious logistic regression model allowing all pairwise interactions between the predictors to eneter the model.
Start: AIC=322.56
Smoke ~ Age
Df Deviance AIC
+ Gender 1 311.68 317.68
<none> 318.56 322.56
+ Ht 1 317.06 323.06
+ FEV 1 317.43 323.43
- Age 1 423.45 425.45
Step: AIC=317.68
Smoke ~ Age + Gender
Df Deviance AIC
+ Ht 1 303.03 311.03
<none> 311.68 317.68
+ FEV 1 311.61 319.61
+ Age:Gender 1 311.64 319.64
- Gender 1 318.56 322.56
- Age 1 419.69 423.69
Step: AIC=311.03
Smoke ~ Age + Gender + Ht
Df Deviance AIC
+ Age:Ht 1 293.67 303.67
+ Ht:Gender 1 298.93 308.93
+ FEV 1 300.02 310.02
<none> 303.03 311.03
+ Age:Gender 1 302.58 312.58
- Ht 1 311.68 317.68
- Gender 1 317.06 323.06
- Age 1 341.01 347.01
Step: AIC=303.67
Smoke ~ Age + Gender + Ht + Age:Ht
Df Deviance AIC
+ Age:Gender 1 290.58 302.58
+ Ht:Gender 1 291.30 303.30
<none> 293.67 303.67
+ FEV 1 291.91 303.91
- Gender 1 301.46 309.46
- Age:Ht 1 303.04 311.04
Step: AIC=302.58
Smoke ~ Age + Gender + Ht + Age:Ht + Age:Gender
Df Deviance AIC
+ Ht:Gender 1 286.85 300.85
+ FEV 1 287.88 301.88
<none> 290.58 302.58
- Age:Gender 1 293.67 303.67
- Age:Ht 1 302.57 312.57
Step: AIC=300.85
Smoke ~ Age + Gender + Ht + Age:Ht + Age:Gender + Gender:Ht
Df Deviance AIC
<none> 286.85 300.85
+ FEV 1 285.23 301.23
- Gender:Ht 1 290.58 302.58
- Age:Gender 1 291.30 303.30
- Age:Ht 1 298.88 310.88
Call: glm(formula = Smoke ~ Age + Gender + Ht + Age:Ht + Age:Gender +
Gender:Ht, family = "binomial", data = lungcap)
Coefficients:
(Intercept) Age GenderM Ht Age:Ht Age:GenderM
-83.28638 5.34130 9.53759 1.20813 -0.07744 0.35773
GenderM:Ht
-0.22838
Degrees of Freedom: 653 Total (i.e. Null); 647 Residual
Null Deviance: 423.4
Residual Deviance: 286.9 AIC: 300.9
- Which models is the most parsimonious, the one with or the one without the pairwise interactions?
End of practical