library(MASS)
library(ggplot2)
library(dplyr)
library(gridExtra)
1 Animals
In this exercise you will work with the Animals
data
from the package MASS
. These data include two variables
measuring the brain and body weight of various animals. The central
question in this exercise is: “Can we predict the weight of an animal’s
brain from the weight of the animal’s body? The exercise shows that with
some simple data transformations and adding a predictor we can go from
0% variance explained to almost 100%!
1.1 Linear model
- Check the help file for
Animals
, and display the row names to get an idea of what kinds of animals are in the data.
rownames(Animals)
## [1] "Mountain beaver" "Cow" "Grey wolf" "Goat"
## [5] "Guinea pig" "Dipliodocus" "Asian elephant" "Donkey"
## [9] "Horse" "Potar monkey" "Cat" "Giraffe"
## [13] "Gorilla" "Human" "African elephant" "Triceratops"
## [17] "Rhesus monkey" "Kangaroo" "Golden hamster" "Mouse"
## [21] "Rabbit" "Sheep" "Jaguar" "Chimpanzee"
## [25] "Rat" "Brachiosaurus" "Mole" "Pig"
- Let’s start with with some data visualizations. Display the boxplots
showing the distributions of
brain
andbody
.
grid.arrange(
ggplot(Animals) +
geom_boxplot(aes(y = brain)),
ggplot(Animals) +
geom_boxplot(aes(y = body)),
nrow = 1)
- Display a scatter plot with
body
on the x-axis andbrain
on the y-axis, and add a linear regression line including the confidence band. Interpret the plot.
ggplot(Animals, aes(x = body, y = brain)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
- Let’s see how the linear regression model performs. Fit the linear
model \(m_0\) predicting
brain
frombody
; display its summary, and interpret the parameter estimates and the R-squared.
<- lm(formula = brain ~ body, data = Animals)
m0 summary(m0)
##
## Call:
## lm(formula = brain ~ body, data = Animals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -576.0 -554.1 -438.1 -156.3 5138.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.764e+02 2.659e+02 2.168 0.0395 *
## body -4.326e-04 1.589e-02 -0.027 0.9785
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1360 on 26 degrees of freedom
## Multiple R-squared: 2.853e-05, Adjusted R-squared: -0.03843
## F-statistic: 0.0007417 on 1 and 26 DF, p-value: 0.9785
- Display the diagnostic residual plots for the previous model, and interpret the results.
par(mfcol = c(2, 2))
plot(m0)
1.2 Data transformations
Since the variables brain
and body
are
substantially skewed, a data transformation may improve the fit of the
model.
- Apply transformations to normalize the distributions of the
variables
body
andbrain
, and check the result with boxplots.
grid.arrange(
ggplot(Animals) +
geom_boxplot(aes(y = log(brain))),
ggplot(Animals) +
geom_boxplot(aes(y = log(body))),
nrow = 1)
- Display a scatter plot of the transformed versions of
brain
andbody
, and add a linear regression line.
ggplot(Animals, aes(x = log(body), y = log(brain))) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
- Fit the linear model
m1
with the transformedbody
predicting the transformedbrain
; display its summary, and interpret the parameter estimates and the R-squared.
<- lm(formula = log(brain) ~ log(body), data = Animals)
m1 summary(m1)
##
## Call:
## lm(formula = log(brain) ~ log(body), data = Animals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2890 -0.6763 0.3316 0.8646 2.5835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.55490 0.41314 6.184 1.53e-06 ***
## log(body) 0.49599 0.07817 6.345 1.02e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.532 on 26 degrees of freedom
## Multiple R-squared: 0.6076, Adjusted R-squared: 0.5925
## F-statistic: 40.26 on 1 and 26 DF, p-value: 1.017e-06
- Display the diagnostic plots for
m1
and interpret. Are the model assumptions fulfilled?
par(mfcol = c(2, 2))
plot(lm(formula = log(brain) ~ log(body), data = Animals))
1.3 Adding a predictor
Additional predictors can help to satisfy the assumptions. For the data at hand, the residuals plots show three outliers that have one thing at common; they are prehistoric species. Let’s see what happens if we add a predictor to the model that distinguishes between extinct and other species.
- Make the factor
species
with levels “extinct” for the prehistoric animals and “other” for the others, and add this factor to the data frameAnimals
.
<- mutate(Animals,
Animals species = recode_factor(rownames(Animals),
Brachiosaurus = "extinct",
Triceratops = "extinct",
Dipliodocus = "extinct",
.default = "other")
)
OR
<- mutate(Animals,
Animals species = case_when(
rownames(Animals) %in% c("Brachiosaurus",
"Triceratops",
"Dipliodocus") ~ "extinct",
TRUE ~ "other")
)
- Fit the model
m2
by adding the predictorspecies
to modelm1
, display its summary, and interpret the parameter estimates and the R-squared.
<- lm(formula = log(brain) ~ log(body) + species, data = Animals)
m2 summary(m2)
##
## Call:
## lm(formula = log(brain) ~ log(body) + species, data = Animals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.9042 -0.4335 -0.1240 0.2410 1.9344
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.05771 0.60118 -5.086 2.98e-05 ***
## log(body) 0.74855 0.04428 16.903 3.43e-15 ***
## speciesother 5.21935 0.53016 9.845 4.39e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7073 on 25 degrees of freedom
## Multiple R-squared: 0.9195, Adjusted R-squared: 0.9131
## F-statistic: 142.9 on 2 and 25 DF, p-value: 2.088e-14
- Display and interpret the diagnostic plots for
m2
.
par(mfcol = c(2, 2))
plot(m2)
The diagnostic plots show outliers that have a higher (logarithm of)
brain
than predicted by the model. These outliers belong to
the class of species know as primates. This result suggests that the
predictor species
could be extended with the category
primates
.
- Add the level “primate” to the variable
species
(check which of the animals are primates).
<- mutate(Animals, species = recode_factor(rownames(Animals),
Animals 'Brachiosaurus' = "extinct",
'Triceratops' = "extinct",
'Dipliodocus' = "extinct",
'Human' = "primate",
'Chimpanzee' = "primate",
'Gorilla' = "primate",
'Rhesus monkey' = "primate",
'Potar monkey' = "primate",
.default = "other")
)
OR
<- mutate(Animals,
Animals species = case_when(rownames(Animals) %in% c('Human',
'Chimpanzee',
'Gorilla',
'Rhesus monkey',
'Potar monkey') ~ "primate",
TRUE ~ species)
)
- Fit the model
m3
with the modified predictorspecies
; display its summary, and interpret the parameter estimates and the R-squared.
<- lm(formula = log(brain) ~ log(body) + species, data = Animals)
m3 summary(m3)
##
## Call:
## lm(formula = log(brain) ~ log(body) + species, data = Animals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16459 -0.29307 -0.06216 0.29418 0.90147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.9303 0.4148 -7.064 2.65e-07 ***
## log(body) 0.7358 0.0306 24.046 < 2e-16 ***
## speciesprimate 6.1776 0.4067 15.191 8.24e-14 ***
## speciesother 4.8689 0.3710 13.124 1.92e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4872 on 24 degrees of freedom
## Multiple R-squared: 0.9633, Adjusted R-squared: 0.9588
## F-statistic: 210.3 on 3 and 24 DF, p-value: < 2.2e-16
- Display and interpret the diagnostic plots for
m3
. What do the residual plots say about the brain weight of the Gorilla?
par(mfcol = c(2, 2))
plot(m3)
- Test the fit of the models
m1
tom3
with theanova()
function.
anova(m1, m2, m3)
## Analysis of Variance Table
##
## Model 1: log(brain) ~ log(body)
## Model 2: log(brain) ~ log(body) + species
## Model 3: log(brain) ~ log(body) + species
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 60.988
## 2 25 12.505 1 48.483 204.264 3.096e-13 ***
## 3 24 5.696 1 6.809 28.687 1.692e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
End of practical