Prediction

Statistical Programming with R

Statistics vs Prediction

Statistics:

hypothesis testing
- significance of parameter estimates

Prediction:

make predictions on new data set
- machine learning (e.g. medical diagnosis)

Prediction

Model training

train the model using some algorithm (e.g. linear model)
to predict the outcome on new data

Model training

distinction between training data and new data
aim is to predict outcome in new data set

Statistical models

Mathematical representation of a theory

\[y=f(x)+\epsilon\]

\(y\) outcome/output/response/dependent variable
\(x\) input/predictors/features
\(f(x)\) prediction function
\(\epsilon\) (irreducible) prediction error

Competing \(f(x)\)

Many competing models

linear models (least squares)
nonlinear models
tree-based methods
support vector machines
neural networks
etc.

Models differ in complexity and interpretability

related to number of parameters in the model

Model complexity and interpretability

Model choice

“All models are wrong, but some are useful” (George Box)

What is the best model choice?

no a priori best model
depends on the data
try out alternatives, and select the best performing

But how to determine which one performs best???

Mean Squared Error

Find the model with minimal MSE:

\[MSE =\frac{1}{n}\sum(y-f(x))^2= \frac{1}{n}\sum\varepsilon^2\]

sum of squared differences between observations and predictions

Example

Polynomial regression models

Increasing model complexity with polynomials

Linear model (2 parameters)

\[f(x)=\beta_0+\beta{x}\]

Quadratic model (3 parameters)

\[f(x)=\beta_0+\beta_1{x}+\beta_2x^2\]

Cubic model (4 parameters)

\[f(x)=\beta_0+\beta_1{x}+\beta_2x^2+\beta_3x^3\]

etc.

Data points generated from \(f(x)\)

What is \(f(x)\)? Is it . . .

linear, quadratic, cubic, etc.?

Strategy for finding \(f(x)\)

Fit all polynomial models to the data

compute the MSE for each model
select the model with the smallest MSE

The model with the smallest MSE most resembles the true \(f(x)\).

TRUE?

Predicted values for \(y\)

Regression lines linear, quadratic, cubic and quartic models

quartic model fits data perfectly

MSE

Comparison of the MSE of the fitted models

Fitted models	MSE	\(R^2\)
linear	2.3584	0.088
quadratic	2.1713	0.16
cubic	0.3526	0.864
quartic	0	1

So the quartic model is the best????

Five new data points from \(f(x)\)

Regression lines for original five data points

How well do the models fit to these new five data points?

MSE for new data points

MSE of the new data points show different picture!

Fitted models	MSE
linear	8.7
quadratic	7.1
cubic	658
quartic	11986.5

quadratic model performs best
quartic model performs terrible

True \(f(x)\)

Data generated from from quadratic model:

\[f(x)=2+x+\epsilon, \ \ \ \ \ \epsilon\sim N(0,4)\]

What did we not find the correct model with the MSE?

simple models are biased (to few predictors)
complex models have high variance (too many predictors)

Bias vs variance

The expected MSE is composed of bias, variance and irreducible error

\[E(MSE) = bias^2 + variance + \sigma^2\]

MSE vs model complexity

Cross validation

Method to find optimal model complexity

Split the data in folds (training and test sets)
- fit the model to the training sets
- compute the MSE on the test sets
- select model with lowest test MSE

5-fold cross validation on example data

We will program it ourselves

this the original example data set

##          x        y
## 1 1.796526 6.341385
## 2 2.116372 4.945655
## 3 2.718560 1.638660
## 4 3.724623 3.867489
## 5 1.605046 3.015605

Folds and training and test sets

Getting the indices for the test sets

with 5-fold cross validation we need 5 test sets

Use sample() to make random test sets

randomly split the row indices 1 to \(n\) into 5 groups (in this case \(n=5\))

folds   <- 5
row_id  <- 1:5
test_id <- split(x = row_id, f = 1:5)  # f is fold 1 to 5
test_id                                # test_id is a list

## $`1`
## [1] 1
## 
## $`2`
## [1] 2
## 
## $`3`
## [1] 3
## 
## $`4`
## [1] 4
## 
## $`5`
## [1] 5

Training and test sets

The training for the 1st fold is

d[test_id[[1]], ]

##          x        y
## 1 1.796526 6.341385

and the test set for the first fold is

d[-test_id[[1]], ]

##          x        y
## 2 2.116372 4.945655
## 3 2.718560 1.638660
## 4 3.724623 3.867489
## 5 1.605046 3.015605

Get the test MSE

train the models on the five training sets
get the predictions on the test set
compute and store the test MSE

Example for the linear model and the first fold

mse_lin    <- matrix(nrow = folds)                    # for storing test MSE's

fit        <- lm(y ~ x, data = d[-test_id[[1]], ])      # fit model to training set

pred       <- predict(fit, newdata = d[test_id[[1]], ]) # get prediction on test set

mse_lin[1] <- mean((d$y[test_id[[1]]] - pred)^2)         # compute test MSE

mse_lin

##          [,1]
## [1,] 8.748547
## [2,]       NA
## [3,]       NA
## [4,]       NA
## [5,]       NA

Now for the 2nd fold

fit        <- lm(y ~ x, data = d[-test_id[[2]], ])      # fit model to training set

pred       <- predict(fit, newdata = d[test_id[[2]], ]) # get prediction on test set

mse_lin[2] <- mean((d$y[test_id[[2]]] - pred)^2)         # compute test MSE

mse_lin

##          [,1]
## [1,] 8.748547
## [2,] 1.100311
## [3,]       NA
## [4,]       NA
## [5,]       NA

Using a `for` loop

Writing the code for 5 folds is cumbersome

we can do it more efficiently with a for loop

mse_lin    <- matrix(nrow = folds)                    

for (i in 1:folds) {     # i takes the value of the consecutive folds
  
  fit        <- lm(y ~ x, data = d[-test_id[[i]], ])      
  
  pred       <- predict(fit, newdata = d[test_id[[i]], ]) 
  
  mse_lin[i] <- mean((d$y[test_id[[i]]] - pred)^2)     # test MSE of the i-th fold
  
}

mse_lin

##           [,1]
## [1,]  8.748547
## [2,]  1.100311
## [3,]  7.703967
## [4,] 14.315792
## [5,]  5.957239

For the quadratic model

mse_quad    <- matrix(nrow = folds)                    

for (i in 1:folds) {     # i takes the value of the consecutive folds
  
  fit         <- lm(y ~ poly(x, 2), data = d[-test_id[[i]], ])     # poly(x, 2) is x + x^2
  
  pred        <- predict(fit, newdata = d[test_id[[i]], ]) 
  
  mse_quad[i] <- mean((d$y[test_id[[i]]] - pred)^2)     # test MSE of the i-th fold
  
}

mse_quad

##            [,1]
## [1,]   8.133892
## [2,]   3.153669
## [3,]  21.568271
## [4,] 629.114316
## [5,]  29.080438

For the cubic model

mse_cub   <- matrix(nrow = folds)                    

for (i in 1:folds) {     
  
  fit        <- lm(y ~ poly(x, 3), data = d[-test_id[[i]], ])     
  
  pred       <- predict(fit, newdata = d[test_id[[i]], ]) 
  
  mse_cub[i] <- mean((d$y[test_id[[i]]] - pred)^2)     
  
}

mse_cub

##              [,1]
## [1,]     3.041523
## [2,]     6.436648
## [3,]    99.268457
## [4,] 11218.260959
## [5,]    13.700736

For the quartic model

We can’t fit the quartic model

there are only 4 data points left in the training set
there are at least 5 data points required for the quartic model

With larger data sets this would not be a problem

Comparing the test MSE’s

data.frame(linear    = mean(mse_lin), 
           quadratic = mean(mse_quad), 
           cubic     = mean(mse_cub),
           row.names = "Average test MSE")

##                    linear quadratic    cubic
## Average test MSE 7.565171  138.2101 2268.142

So with cross validation we find the correct model

select the linear model for prediction on new data set

Practical

Perform cross validation on the cars data set

Is the stopping distance of a car a linear, quadratic or cubic function of the speed?

Statistics vs Prediction

Prediction

Model training

Statistical models

Competing \(f(x)\)

Model complexity and interpretability

Model choice

Mean Squared Error

Example

Polynomial regression models

Data points generated from \(f(x)\)

Strategy for finding \(f(x)\)

Predicted values for \(y\)

MSE

Five new data points from \(f(x)\)

MSE for new data points

True \(f(x)\)

Bias vs variance

MSE vs model complexity

Cross validation

5-fold cross validation on example data

Folds and training and test sets

Training and test sets

Get the test MSE

Now for the 2nd fold

Using a for loop

For the quadratic model

For the cubic model

For the quartic model

Comparing the test MSE’s

Practical

Using a `for` loop