library(MASS)
library(ggplot2)
library(dplyr)
library(gridExtra)1 Animals
In this exercise you will work with the Animals data
from the package MASS. These data include two variables
measuring the brain and body weight of various animals. The central
question in this exercise is: “Can we predict the weight of an animal’s
brain from the weight of the animal’s body? The exercise shows that with
some simple data transformations and adding a predictor we can go from
0% variance explained to almost 100%!
1.1 Linear model
- Check the help file for Animals, and display the row names to get an idea of what kinds of animals are in the data.
rownames(Animals)##  [1] "Mountain beaver"  "Cow"              "Grey wolf"        "Goat"            
##  [5] "Guinea pig"       "Dipliodocus"      "Asian elephant"   "Donkey"          
##  [9] "Horse"            "Potar monkey"     "Cat"              "Giraffe"         
## [13] "Gorilla"          "Human"            "African elephant" "Triceratops"     
## [17] "Rhesus monkey"    "Kangaroo"         "Golden hamster"   "Mouse"           
## [21] "Rabbit"           "Sheep"            "Jaguar"           "Chimpanzee"      
## [25] "Rat"              "Brachiosaurus"    "Mole"             "Pig"- Let’s start with with some data visualizations. Display the boxplots
showing the distributions of brainandbody.
grid.arrange(
  ggplot(Animals) +
    geom_boxplot(aes(y = brain)),
  ggplot(Animals) +
    geom_boxplot(aes(y = body)), 
  nrow = 1)- Display a scatter plot with bodyon the x-axis andbrainon the y-axis, and add a linear regression line including the confidence band. Interpret the plot.
ggplot(Animals, aes(x = body, y = brain)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal()- Let’s see how the linear regression model performs. Fit the linear
model \(m_0\) predicting
brainfrombody; display its summary, and interpret the parameter estimates and the R-squared.
m0 <- lm(formula = brain ~ body, data = Animals)
summary(m0)## 
## Call:
## lm(formula = brain ~ body, data = Animals)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -576.0 -554.1 -438.1 -156.3 5138.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.764e+02  2.659e+02   2.168   0.0395 *
## body        -4.326e-04  1.589e-02  -0.027   0.9785  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1360 on 26 degrees of freedom
## Multiple R-squared:  2.853e-05,  Adjusted R-squared:  -0.03843 
## F-statistic: 0.0007417 on 1 and 26 DF,  p-value: 0.9785- Display the diagnostic residual plots for the previous model, and interpret the results.
par(mfcol = c(2, 2))
plot(m0)1.2 Data transformations
Since the variables brain and body are
substantially skewed, a data transformation may improve the fit of the
model.
- Apply transformations to normalize the distributions of the
variables bodyandbrain, and check the result with boxplots.
grid.arrange(
  ggplot(Animals) +
    geom_boxplot(aes(y = log(brain))),
  ggplot(Animals) +
     geom_boxplot(aes(y = log(body))),
    nrow = 1)- Display a scatter plot of the transformed versions of
brainandbody, and add a linear regression line.
ggplot(Animals, aes(x = log(body), y = log(brain))) +
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal()- Fit the linear model m1with the transformedbodypredicting the transformedbrain; display its summary, and interpret the parameter estimates and the R-squared.
m1 <- lm(formula = log(brain) ~ log(body), data = Animals)
summary(m1)## 
## Call:
## lm(formula = log(brain) ~ log(body), data = Animals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2890 -0.6763  0.3316  0.8646  2.5835 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.55490    0.41314   6.184 1.53e-06 ***
## log(body)    0.49599    0.07817   6.345 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.532 on 26 degrees of freedom
## Multiple R-squared:  0.6076, Adjusted R-squared:  0.5925 
## F-statistic: 40.26 on 1 and 26 DF,  p-value: 1.017e-06- Display the diagnostic plots for m1and interpret. Are the model assumptions fulfilled?
par(mfcol = c(2, 2))
plot(lm(formula = log(brain) ~ log(body), data = Animals))1.3 Adding a predictor
Additional predictors can help to satisfy the assumptions. For the data at hand, the residuals plots show three outliers that have one thing at common; they are prehistoric species. Let’s see what happens if we add a predictor to the model that distinguishes between extinct and other species.
- Make the factor specieswith levels “extinct” for the prehistoric animals and “other” for the others, and add this factor to the data frameAnimals.
Animals <- mutate(Animals, 
                  species = recode_factor(rownames(Animals),
                                          Brachiosaurus = "extinct",
                                          Triceratops   = "extinct",
                                          Dipliodocus   = "extinct", 
                                          .default      = "other")
                  )OR
Animals <- mutate(Animals, 
                  species = case_when(
                    rownames(Animals) %in% c("Brachiosaurus",
                                             "Triceratops",
                                             "Dipliodocus") ~ "extinct",
                    TRUE ~ "other")
                  )- Fit the model m2by adding the predictorspeciesto modelm1, display its summary, and interpret the parameter estimates and the R-squared.
m2 <- lm(formula = log(brain) ~ log(body) + species, data = Animals)
summary(m2)## 
## Call:
## lm(formula = log(brain) ~ log(body) + species, data = Animals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9042 -0.4335 -0.1240  0.2410  1.9344 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.05771    0.60118  -5.086 2.98e-05 ***
## log(body)     0.74855    0.04428  16.903 3.43e-15 ***
## speciesother  5.21935    0.53016   9.845 4.39e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7073 on 25 degrees of freedom
## Multiple R-squared:  0.9195, Adjusted R-squared:  0.9131 
## F-statistic: 142.9 on 2 and 25 DF,  p-value: 2.088e-14- Display and interpret the diagnostic plots for m2.
par(mfcol = c(2, 2))
plot(m2)The diagnostic plots show outliers that have a higher (logarithm of)
brain than predicted by the model. These outliers belong to
the class of species know as primates. This result suggests that the
predictor species could be extended with the category
primates.
- Add the level “primate” to the variable species(check which of the animals are primates).
Animals  <- mutate(Animals, species = recode_factor(rownames(Animals),
                                  'Brachiosaurus' = "extinct",
                                  'Triceratops'   = "extinct",
                                  'Dipliodocus'   = "extinct", 
                                  'Human'         = "primate",
                                  'Chimpanzee'    = "primate",
                                  'Gorilla'       = "primate",
                                  'Rhesus monkey' = "primate",
                                  'Potar monkey'  = "primate",
                                  .default        = "other")
                   )OR
Animals <- mutate(Animals, 
                  species = case_when(rownames(Animals) %in% c('Human', 
                                                               'Chimpanzee', 
                                                               'Gorilla',
                                                               'Rhesus monkey', 
                                                               'Potar monkey') ~ "primate", 
                                      TRUE ~ species)
                  )- Fit the model m3with the modified predictorspecies; display its summary, and interpret the parameter estimates and the R-squared.
m3 <- lm(formula = log(brain) ~ log(body) + species, data = Animals)
summary(m3)## 
## Call:
## lm(formula = log(brain) ~ log(body) + species, data = Animals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16459 -0.29307 -0.06216  0.29418  0.90147 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.9303     0.4148  -7.064 2.65e-07 ***
## log(body)        0.7358     0.0306  24.046  < 2e-16 ***
## speciesprimate   6.1776     0.4067  15.191 8.24e-14 ***
## speciesother     4.8689     0.3710  13.124 1.92e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4872 on 24 degrees of freedom
## Multiple R-squared:  0.9633, Adjusted R-squared:  0.9588 
## F-statistic: 210.3 on 3 and 24 DF,  p-value: < 2.2e-16- Display and interpret the diagnostic plots for m3. What do the residual plots say about the brain weight of the Gorilla?
par(mfcol = c(2, 2))
plot(m3)- Test the fit of the models m1tom3with theanova()function.
anova(m1, m2, m3)## Analysis of Variance Table
## 
## Model 1: log(brain) ~ log(body)
## Model 2: log(brain) ~ log(body) + species
## Model 3: log(brain) ~ log(body) + species
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     26 60.988                                   
## 2     25 12.505  1    48.483 204.264 3.096e-13 ***
## 3     24  5.696  1     6.809  28.687 1.692e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1End of practical