The data sets cars
contain the two variables
speed
and distance
. In this practical we use
5-fold cross validation to check whether the stopping distance of a car
is linearly, qudraticly or cubicly related to the speed of the car. We
follow the same steps as in the lecture sheets.
- Data exploration
- Display a plot of the data.
plot(cars)
- Display the summary of a cubic model predicting
dist
fromspeed
. What do you conclude from this model?
summary(lm(dist ~ poly(speed, 3), cars))
##
## Call:
## lm(formula = dist ~ poly(speed, 3), data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.670 -9.601 -2.231 7.075 44.691
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.98 2.15 19.988 < 2e-16 ***
## poly(speed, 3)1 145.55 15.21 9.573 1.6e-12 ***
## poly(speed, 3)2 23.00 15.21 1.512 0.137
## poly(speed, 3)3 13.80 15.21 0.907 0.369
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.2 on 46 degrees of freedom
## Multiple R-squared: 0.6732, Adjusted R-squared: 0.6519
## F-statistic: 31.58 on 3 and 46 DF, p-value: 3.074e-11
- If you display the data you will see that the cases are ordered in
ascending order of
speed
. We don’t want that we are going to make the folds. To mixed them up a little bit run the following line of code:
set.seed(2)
<- cars[sample(1:nrow(cars)), ] cars
- Initializing variables
- Start with making the variable
folds
for 5-fold cross validation.
<- 5 folds
- Make the list
test_id
by sampling of the indices for the test sets. Display the list to check its content.
<- split(1:nrow(cars), 1:5)
test_id test_id
## $`1`
## [1] 1 6 11 16 21 26 31 36 41 46
##
## $`2`
## [1] 2 7 12 17 22 27 32 37 42 47
##
## $`3`
## [1] 3 8 13 18 23 28 33 38 43 48
##
## $`4`
## [1] 4 9 14 19 24 29 34 39 44 49
##
## $`5`
## [1] 5 10 15 20 25 30 35 40 45 50
- Get the MSE for the first fold of the linear model.
- Make the matrix
mse_lin
withfolds
rows to store the test MSE’s of the linear model.
<- matrix(nrow = folds) mse_lin
- Fit the linear model to the training set
cars[-test_id[[1]], ]
of the first fold (beware to use the correct number of square brackets!). Save the object asfit
.
<- lm(dist ~ speed, cars[-test_id[[1]], ]) fit
- Get the predictions for the test set and save then as
pred
. Display the result.
<- predict(fit, newdata = cars[test_id[[1]], ])
pred
pred
## 21 8 11 45 34 35 38 24
## 36.78690 20.97032 24.92447 72.37421 52.60349 52.60349 56.55763 40.74105
## 5 42
## 13.06203 60.51178
- Compute the test MS and store it in the 1st row of
pred
.
1] <- mean((cars$dist[test_id[[1]]] - pred)^2) mse_lin[
- Display
pred
. It should be a matrix with 5 rows with the test MSE of the 1st fold in the first row.
mse_lin
## [,1]
## [1,] 249.6232
## [2,] NA
## [3,] NA
## [4,] NA
## [5,] NA
- Programming the loops for the linear, quadratic and cubic models
- Write the
for
loop for cross validation of the linear model. Follow the example in the lecture sheets (don’t forget to change the names of the data set and the variables!). Save the test MSE’s inmse_lin
. Run the loop and display the content ofmse_lin
.
for (i in 1:folds) {
<- lm(dist ~ speed, data = cars[-test_id[[i]], ])
fit
<- predict(fit, newdata = cars[test_id[[i]], ])
pred
<- mean((cars$dist[test_id[[i]]] - pred)^2) # test MSE of the i-th fold
mse_lin[i]
}
mse_lin
## [,1]
## [1,] 249.6232
## [2,] 340.9066
## [3,] 228.3121
## [4,] 138.8675
## [5,] 275.8684
- Now do the same for the quadratic model. Don’t forget to make the
matrix
mse_quad
to store the test MSE’s before you run the loop.
<- matrix(nrow = folds)
mse_quad
for (i in 1:folds) {
<- lm(dist ~ poly(speed, 2), data = cars[-test_id[[i]], ])
fit
<- predict(fit, newdata = cars[test_id[[i]], ])
pred
<- mean((cars$dist[test_id[[i]]] - pred)^2) # test MSE of the i-th fold
mse_quad[i]
}
mse_quad
## [,1]
## [1,] 290.7010
## [2,] 319.7124
## [3,] 235.1947
## [4,] 108.8191
## [5,] 290.2651
- And for the cubic model.
<- matrix(nrow = folds)
mse_cub
for (i in 1:folds) {
<- lm(dist ~ poly(speed, 3), data = cars[-test_id[[i]], ])
fit
<- predict(fit, newdata = cars[test_id[[i]], ])
pred
<- mean((cars$dist[test_id[[i]]] - pred)^2) # test MSE of the i-th fold
mse_cub[i]
}
mse_cub
## [,1]
## [1,] 348.5955
## [2,] 324.2862
## [3,] 224.7608
## [4,] 109.5376
## [5,] 288.7389
- Model selection
- Display a data frame with the averaged MSE’s for the linear, quadratric and cubic models.
data.frame(linear = mean(mse_lin),
quadratic = mean(mse_quad),
cubic = mean(mse_cub),
row.names = "Average test MSE")
## linear quadratic cubic
## Average test MSE 246.7156 248.9384 259.1838
What is your conclusion regarding the relationship between stopping distance and speed of a car?
Run the code again using a different seed than 2 in exercise 1c. Do you reach the same conclusion?
End of Practical K
.