Practical G: Solutions

Linear Model

Maarten Cruyff


library(MASS)
library(ggplot2)
library(dplyr)
library(gridExtra)

1 Animals

In this exercise you will work with the Animals data from the package MASS. These data include two variables measuring the brain and body weight of various animals. The central question in this exercise is: “Can we predict the weight of an animal’s brain from the weight of the animal’s body? The exercise shows that with some simple data transformations and adding a predictor we can go from 0% variance explained to almost 100%!

1.1 Linear model

  1. Check the help file for Animals, and display the row names to get an idea of what kinds of animals are in the data.
rownames(Animals)
##  [1] "Mountain beaver"  "Cow"              "Grey wolf"        "Goat"            
##  [5] "Guinea pig"       "Dipliodocus"      "Asian elephant"   "Donkey"          
##  [9] "Horse"            "Potar monkey"     "Cat"              "Giraffe"         
## [13] "Gorilla"          "Human"            "African elephant" "Triceratops"     
## [17] "Rhesus monkey"    "Kangaroo"         "Golden hamster"   "Mouse"           
## [21] "Rabbit"           "Sheep"            "Jaguar"           "Chimpanzee"      
## [25] "Rat"              "Brachiosaurus"    "Mole"             "Pig"
  1. Let’s start with with some data visualizations. Display the boxplots showing the distributions of brain and body.
grid.arrange(
  ggplot(Animals) +
    geom_boxplot(aes(y = brain)),
  ggplot(Animals) +
    geom_boxplot(aes(y = body)), 
  nrow = 1)

  1. Display a scatter plot with body on the x-axis and brain on the y-axis, and add a linear regression line including the confidence band. Interpret the plot.
ggplot(Animals, aes(x = body, y = brain)) +
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal()

  1. Let’s see how the linear regression model performs. Fit the linear model \(m_0\) predicting brain from body; display its summary, and interpret the parameter estimates and the R-squared.
m0 <- lm(formula = brain ~ body, data = Animals)
summary(m0)
## 
## Call:
## lm(formula = brain ~ body, data = Animals)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -576.0 -554.1 -438.1 -156.3 5138.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  5.764e+02  2.659e+02   2.168   0.0395 *
## body        -4.326e-04  1.589e-02  -0.027   0.9785  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1360 on 26 degrees of freedom
## Multiple R-squared:  2.853e-05,  Adjusted R-squared:  -0.03843 
## F-statistic: 0.0007417 on 1 and 26 DF,  p-value: 0.9785
  1. Display the diagnostic residual plots for the previous model, and interpret the results.
par(mfcol = c(2, 2))
plot(m0)

1.2 Data transformations

Since the variables brain and body are substantially skewed, a data transformation may improve the fit of the model.

  1. Apply transformations to normalize the distributions of the variables body and brain, and check the result with boxplots.
grid.arrange(
  ggplot(Animals) +
    geom_boxplot(aes(y = log(brain))),
  ggplot(Animals) +
     geom_boxplot(aes(y = log(body))),
    nrow = 1)

  1. Display a scatter plot of the transformed versions of brain and body, and add a linear regression line.
ggplot(Animals, aes(x = log(body), y = log(brain))) +
  geom_point() + 
  geom_smooth(method = "lm") +
  theme_minimal()

  1. Fit the linear model m1 with the transformed body predicting the transformed brain; display its summary, and interpret the parameter estimates and the R-squared.
m1 <- lm(formula = log(brain) ~ log(body), data = Animals)
summary(m1)
## 
## Call:
## lm(formula = log(brain) ~ log(body), data = Animals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2890 -0.6763  0.3316  0.8646  2.5835 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.55490    0.41314   6.184 1.53e-06 ***
## log(body)    0.49599    0.07817   6.345 1.02e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.532 on 26 degrees of freedom
## Multiple R-squared:  0.6076, Adjusted R-squared:  0.5925 
## F-statistic: 40.26 on 1 and 26 DF,  p-value: 1.017e-06
  1. Display the diagnostic plots for m1 and interpret. Are the model assumptions fulfilled?
par(mfcol = c(2, 2))
plot(lm(formula = log(brain) ~ log(body), data = Animals))

1.3 Adding a predictor

Additional predictors can help to satisfy the assumptions. For the data at hand, the residuals plots show three outliers that have one thing at common; they are prehistoric species. Let’s see what happens if we add a predictor to the model that distinguishes between extinct and other species.

  1. Make the factor species with levels “extinct” for the prehistoric animals and “other” for the others, and add this factor to the data frame Animals.
Animals <- mutate(Animals, 
                  species = recode_factor(rownames(Animals),
                                          Brachiosaurus = "extinct",
                                          Triceratops   = "extinct",
                                          Dipliodocus   = "extinct", 
                                          .default      = "other")
                  )

OR

Animals <- mutate(Animals, 
                  species = case_when(
                    rownames(Animals) %in% c("Brachiosaurus",
                                             "Triceratops",
                                             "Dipliodocus") ~ "extinct",
                    TRUE ~ "other")
                  )
  1. Fit the model m2 by adding the predictor species to model m1, display its summary, and interpret the parameter estimates and the R-squared.
m2 <- lm(formula = log(brain) ~ log(body) + species, data = Animals)
summary(m2)
## 
## Call:
## lm(formula = log(brain) ~ log(body) + species, data = Animals)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9042 -0.4335 -0.1240  0.2410  1.9344 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.05771    0.60118  -5.086 2.98e-05 ***
## log(body)     0.74855    0.04428  16.903 3.43e-15 ***
## speciesother  5.21935    0.53016   9.845 4.39e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7073 on 25 degrees of freedom
## Multiple R-squared:  0.9195, Adjusted R-squared:  0.9131 
## F-statistic: 142.9 on 2 and 25 DF,  p-value: 2.088e-14
  1. Display and interpret the diagnostic plots for m2.
par(mfcol = c(2, 2))
plot(m2)

The diagnostic plots show outliers that have a higher (logarithm of) brain than predicted by the model. These outliers belong to the class of species know as primates. This result suggests that the predictor species could be extended with the category primates.

  1. Add the level “primate” to the variable species (check which of the animals are primates).
Animals  <- mutate(Animals, species = recode_factor(rownames(Animals),
                                  'Brachiosaurus' = "extinct",
                                  'Triceratops'   = "extinct",
                                  'Dipliodocus'   = "extinct", 
                                  'Human'         = "primate",
                                  'Chimpanzee'    = "primate",
                                  'Gorilla'       = "primate",
                                  'Rhesus monkey' = "primate",
                                  'Potar monkey'  = "primate",
                                  .default        = "other")
                   )

OR

Animals <- mutate(Animals, 
                  species = case_when(rownames(Animals) %in% c('Human', 
                                                               'Chimpanzee', 
                                                               'Gorilla',
                                                               'Rhesus monkey', 
                                                               'Potar monkey') ~ "primate", 
                                      TRUE ~ species)
                  )
  1. Fit the model m3 with the modified predictor species; display its summary, and interpret the parameter estimates and the R-squared.
m3 <- lm(formula = log(brain) ~ log(body) + species, data = Animals)
summary(m3)
## 
## Call:
## lm(formula = log(brain) ~ log(body) + species, data = Animals)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16459 -0.29307 -0.06216  0.29418  0.90147 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.9303     0.4148  -7.064 2.65e-07 ***
## log(body)        0.7358     0.0306  24.046  < 2e-16 ***
## speciesprimate   6.1776     0.4067  15.191 8.24e-14 ***
## speciesother     4.8689     0.3710  13.124 1.92e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4872 on 24 degrees of freedom
## Multiple R-squared:  0.9633, Adjusted R-squared:  0.9588 
## F-statistic: 210.3 on 3 and 24 DF,  p-value: < 2.2e-16
  1. Display and interpret the diagnostic plots for m3. What do the residual plots say about the brain weight of the Gorilla?
par(mfcol = c(2, 2))
plot(m3)

  1. Test the fit of the models m1 to m3 with the anova() function.
anova(m1, m2, m3)
## Analysis of Variance Table
## 
## Model 1: log(brain) ~ log(body)
## Model 2: log(brain) ~ log(body) + species
## Model 3: log(brain) ~ log(body) + species
##   Res.Df    RSS Df Sum of Sq       F    Pr(>F)    
## 1     26 60.988                                   
## 2     25 12.505  1    48.483 204.264 3.096e-13 ***
## 3     24  5.696  1     6.809  28.687 1.692e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

End of practical