Statistical Programming with R

Statistics vs Prediction

Statistics:

  • hypothesis testing

    • significance of parameter estimates


Prediction:

  • make predictions on new data set

    • machine learning (e.g. medical diagnosis)

Prediction

Model training

  • train the model using some algorithm (e.g. linear model)

  • to predict the outcome on new data

Model training

  • distinction between training data and new data

  • aim is to predict outcome in new data set

Statistical models

Mathematical representation of a theory


\[y=f(x)+\epsilon\]


  • \(y\) outcome/output/response/dependent variable

  • \(x\) input/predictors/features

  • \(f(x)\) prediction function

  • \(\epsilon\) (irreducible) prediction error

Competing \(f(x)\)

Many competing models

  • linear models (least squares)

  • nonlinear models

  • tree-based methods

  • support vector machines

  • neural networks

  • etc.


Models differ in complexity and interpretability

  • related to number of parameters in the model

Model complexity and interpretability

Model choice

“All models are wrong, but some are useful” (George Box)


What is the best model choice?

  • no a priori best model

  • depends on the data

  • try out alternatives, and select the best performing


But how to determine which one performs best???

Mean Squared Error

Find the model with minimal MSE:

\[MSE =\frac{1}{n}\sum(y-f(x))^2= \frac{1}{n}\sum\varepsilon^2\]

  • sum of squared differences between observations and predictions

Example

Polynomial regression models

Increasing model complexity with polynomials


  • Linear model (2 parameters)

\[f(x)=\beta_0+\beta{x}\]

  • Quadratic model (3 parameters)

\[f(x)=\beta_0+\beta_1{x}+\beta_2x^2\]

  • Cubic model (4 parameters)

\[f(x)=\beta_0+\beta_1{x}+\beta_2x^2+\beta_3x^3\]

etc.

Data points generated from \(f(x)\)

What is \(f(x)\)? Is it . . .

  • linear, quadratic, cubic, etc.?

Strategy for finding \(f(x)\)

Fit all polynomial models to the data

  • compute the MSE for each model

  • select the model with the smallest MSE


The model with the smallest MSE most resembles the true \(f(x)\).


TRUE?

Predicted values for \(y\)

Regression lines linear, quadratic, cubic and quartic models

  • quartic model fits data perfectly

MSE

Comparison of the MSE of the fitted models

Fitted models MSE \(R^2\)
linear 2.3584 0.088
quadratic 2.1713 0.16
cubic 0.3526 0.864
quartic 0 1


So the quartic model is the best????

Five new data points from \(f(x)\)

Regression lines for original five data points

  • How well do the models fit to these new five data points?

MSE for new data points

MSE of the new data points show different picture!


Fitted models MSE
linear 8.7
quadratic 7.1
cubic 658
quartic 11986.5


  • quadratic model performs best

  • quartic model performs terrible

True \(f(x)\)

Data generated from from quadratic model:


\[f(x)=2+x+\epsilon, \ \ \ \ \ \epsilon\sim N(0,4)\]


What did we not find the correct model with the MSE?

  • simple models are biased (to few predictors)

  • complex models have high variance (too many predictors)

Bias vs variance

The expected MSE is composed of bias, variance and irreducible error

\[E(MSE) = bias^2 + variance + \sigma^2\]


MSE vs model complexity

Cross validation

Method to find optimal model complexity

  • Split the data in folds (training and test sets)
    • fit the model to the training sets
    • compute the MSE on the test sets
    • select model with lowest test MSE

5-fold cross validation on example data

We will program it ourselves

  • this the original example data set
d
##          x        y
## 1 1.796526 6.341385
## 2 2.116372 4.945655
## 3 2.718560 1.638660
## 4 3.724623 3.867489
## 5 1.605046 3.015605

Folds and training and test sets

Getting the indices for the test sets

  • with 5-fold cross validation we need 5 test sets

Use sample() to make random test sets

  • randomly split the row indices 1 to \(n\) into 5 groups (in this case \(n=5\))
folds   <- 5
row_id  <- 1:5
test_id <- split(x = row_id, f = 1:5)  # f is fold 1 to 5
test_id                                # test_id is a list
## $`1`
## [1] 1
## 
## $`2`
## [1] 2
## 
## $`3`
## [1] 3
## 
## $`4`
## [1] 4
## 
## $`5`
## [1] 5

Training and test sets

The training for the 1st fold is

d[test_id[[1]], ]
##          x        y
## 1 1.796526 6.341385

and the test set for the first fold is

d[-test_id[[1]], ]
##          x        y
## 2 2.116372 4.945655
## 3 2.718560 1.638660
## 4 3.724623 3.867489
## 5 1.605046 3.015605

Get the test MSE

  1. train the models on the five training sets
  2. get the predictions on the test set
  3. compute and store the test MSE

Example for the linear model and the first fold

mse_lin    <- matrix(nrow = folds)                    # for storing test MSE's

fit        <- lm(y ~ x, data = d[-test_id[[1]], ])      # fit model to training set

pred       <- predict(fit, newdata = d[test_id[[1]], ]) # get prediction on test set

mse_lin[1] <- mean((d$y[test_id[[1]]] - pred)^2)         # compute test MSE

mse_lin
##          [,1]
## [1,] 8.748547
## [2,]       NA
## [3,]       NA
## [4,]       NA
## [5,]       NA

Now for the 2nd fold

fit        <- lm(y ~ x, data = d[-test_id[[2]], ])      # fit model to training set

pred       <- predict(fit, newdata = d[test_id[[2]], ]) # get prediction on test set

mse_lin[2] <- mean((d$y[test_id[[2]]] - pred)^2)         # compute test MSE

mse_lin
##          [,1]
## [1,] 8.748547
## [2,] 1.100311
## [3,]       NA
## [4,]       NA
## [5,]       NA

Using a for loop

Writing the code for 5 folds is cumbersome

  • we can do it more efficiently with a for loop
mse_lin    <- matrix(nrow = folds)                    

for (i in 1:folds) {     # i takes the value of the consecutive folds
  
  fit        <- lm(y ~ x, data = d[-test_id[[i]], ])      
  
  pred       <- predict(fit, newdata = d[test_id[[i]], ]) 
  
  mse_lin[i] <- mean((d$y[test_id[[i]]] - pred)^2)     # test MSE of the i-th fold
  
}

mse_lin
##           [,1]
## [1,]  8.748547
## [2,]  1.100311
## [3,]  7.703967
## [4,] 14.315792
## [5,]  5.957239

For the quadratic model

mse_quad    <- matrix(nrow = folds)                    

for (i in 1:folds) {     # i takes the value of the consecutive folds
  
  fit         <- lm(y ~ poly(x, 2), data = d[-test_id[[i]], ])     # poly(x, 2) is x + x^2
  
  pred        <- predict(fit, newdata = d[test_id[[i]], ]) 
  
  mse_quad[i] <- mean((d$y[test_id[[i]]] - pred)^2)     # test MSE of the i-th fold
  
}

mse_quad
##            [,1]
## [1,]   8.133892
## [2,]   3.153669
## [3,]  21.568271
## [4,] 629.114316
## [5,]  29.080438

For the cubic model

mse_cub   <- matrix(nrow = folds)                    

for (i in 1:folds) {     
  
  fit        <- lm(y ~ poly(x, 3), data = d[-test_id[[i]], ])     
  
  pred       <- predict(fit, newdata = d[test_id[[i]], ]) 
  
  mse_cub[i] <- mean((d$y[test_id[[i]]] - pred)^2)     
  
}

mse_cub
##              [,1]
## [1,]     3.041523
## [2,]     6.436648
## [3,]    99.268457
## [4,] 11218.260959
## [5,]    13.700736

For the quartic model

We can’t fit the quartic model

  • there are only 4 data points left in the training set

  • there are at least 5 data points required for the quartic model


With larger data sets this would not be a problem

Comparing the test MSE’s

data.frame(linear    = mean(mse_lin), 
           quadratic = mean(mse_quad), 
           cubic     = mean(mse_cub),
           row.names = "Average test MSE")
##                    linear quadratic    cubic
## Average test MSE 7.565171  138.2101 2268.142

So with cross validation we find the correct model

  • select the linear model for prediction on new data set

Practical

Perform cross validation on the cars data set

  • Is the stopping distance of a car a linear, quadratic or cubic function of the speed?