Practical K Solutions

The data sets cars contain the two variables speed and distance. In this practical we use 5-fold cross validation to check whether the stopping distance of a car is linearly, qudraticly or cubicly related to the speed of the car. We follow the same steps as in the lecture sheets.

Data exploration

Display a plot of the data.

plot(cars)

Display the summary of a cubic model predicting dist from speed. What do you conclude from this model?

summary(lm(dist ~ poly(speed, 3), cars))

## 
## Call:
## lm(formula = dist ~ poly(speed, 3), data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -26.670  -9.601  -2.231   7.075  44.691 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        42.98       2.15  19.988  < 2e-16 ***
## poly(speed, 3)1   145.55      15.21   9.573  1.6e-12 ***
## poly(speed, 3)2    23.00      15.21   1.512    0.137    
## poly(speed, 3)3    13.80      15.21   0.907    0.369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.2 on 46 degrees of freedom
## Multiple R-squared:  0.6732, Adjusted R-squared:  0.6519 
## F-statistic: 31.58 on 3 and 46 DF,  p-value: 3.074e-11

If you display the data you will see that the cases are ordered in ascending order of speed. We don’t want that we are going to make the folds. To mixed them up a little bit run the following line of code:

set.seed(2)
cars <- cars[sample(1:nrow(cars)), ]

Initializing variables

Start with making the variable folds for 5-fold cross validation.

folds <- 5

Make the list test_id by sampling of the indices for the test sets. Display the list to check its content.

test_id <- split(1:nrow(cars), 1:5)
test_id

## $`1`
##  [1]  1  6 11 16 21 26 31 36 41 46
## 
## $`2`
##  [1]  2  7 12 17 22 27 32 37 42 47
## 
## $`3`
##  [1]  3  8 13 18 23 28 33 38 43 48
## 
## $`4`
##  [1]  4  9 14 19 24 29 34 39 44 49
## 
## $`5`
##  [1]  5 10 15 20 25 30 35 40 45 50

Get the MSE for the first fold of the linear model.

Make the matrix mse_lin with folds rows to store the test MSE’s of the linear model.

mse_lin <- matrix(nrow = folds)

Fit the linear model to the training set cars[-test_id[[1]], ] of the first fold (beware to use the correct number of square brackets!). Save the object as fit.

fit <- lm(dist ~ speed, cars[-test_id[[1]], ])

Get the predictions for the test set and save then as pred. Display the result.

pred <- predict(fit, newdata = cars[test_id[[1]], ])

pred

##       21        8       11       45       34       35       38       24 
## 36.78690 20.97032 24.92447 72.37421 52.60349 52.60349 56.55763 40.74105 
##        5       42 
## 13.06203 60.51178

Compute the test MS and store it in the 1st row of pred.

mse_lin[1] <- mean((cars$dist[test_id[[1]]] - pred)^2)

Display pred. It should be a matrix with 5 rows with the test MSE of the 1st fold in the first row.

mse_lin

##          [,1]
## [1,] 249.6232
## [2,]       NA
## [3,]       NA
## [4,]       NA
## [5,]       NA

Programming the loops for the linear, quadratic and cubic models

Write the for loop for cross validation of the linear model. Follow the example in the lecture sheets (don’t forget to change the names of the data set and the variables!). Save the test MSE’s in mse_lin. Run the loop and display the content of mse_lin.

for (i in 1:folds) {
  
  fit        <- lm(dist ~ speed, data = cars[-test_id[[i]], ])      
  
  pred       <- predict(fit, newdata = cars[test_id[[i]], ]) 
  
  mse_lin[i] <- mean((cars$dist[test_id[[i]]] - pred)^2)     # test MSE of the i-th fold
  
}

mse_lin

##          [,1]
## [1,] 249.6232
## [2,] 340.9066
## [3,] 228.3121
## [4,] 138.8675
## [5,] 275.8684

Now do the same for the quadratic model. Don’t forget to make the matrix mse_quad to store the test MSE’s before you run the loop.

mse_quad <- matrix(nrow = folds)

for (i in 1:folds) {
  
  fit         <- lm(dist ~ poly(speed, 2), data = cars[-test_id[[i]], ])      
  
  pred        <- predict(fit, newdata = cars[test_id[[i]], ]) 
  
  mse_quad[i] <- mean((cars$dist[test_id[[i]]] - pred)^2)     # test MSE of the i-th fold
  
}

mse_quad

##          [,1]
## [1,] 290.7010
## [2,] 319.7124
## [3,] 235.1947
## [4,] 108.8191
## [5,] 290.2651

And for the cubic model.

mse_cub <- matrix(nrow = folds)

for (i in 1:folds) {
  
  fit         <- lm(dist ~ poly(speed, 3), data = cars[-test_id[[i]], ])      
  
  pred        <- predict(fit, newdata = cars[test_id[[i]], ]) 
  
  mse_cub[i]  <- mean((cars$dist[test_id[[i]]] - pred)^2)     # test MSE of the i-th fold
  
}

mse_cub

##          [,1]
## [1,] 348.5955
## [2,] 324.2862
## [3,] 224.7608
## [4,] 109.5376
## [5,] 288.7389

Model selection

Display a data frame with the averaged MSE’s for the linear, quadratric and cubic models.

data.frame(linear    = mean(mse_lin), 
           quadratic = mean(mse_quad), 
           cubic     = mean(mse_cub),
           row.names = "Average test MSE")

##                    linear quadratic    cubic
## Average test MSE 246.7156  248.9384 259.1838

What is your conclusion regarding the relationship between stopping distance and speed of a car?
Run the code again using a different seed than 2 in exercise 1c. Do you reach the same conclusion?

End of Practical K.

Practical K Solutions

Maarten Cruyff

Statistical Programming in R