Statistics:
hypothesis testing
- significance of parameter estimates
Prediction:
make predictions on new data set
- machine learning (e.g. medical diagnosis)
Statistical Programming with R
Statistics:
hypothesis testing
Prediction:
make predictions on new data set
Model training
train the model using some algorithm (e.g. linear model)
to predict the outcome on new data
distinction between training data and new data
aim is to predict outcome in new data set
Mathematical representation of a theory
\[y=f(x)+\epsilon\]
\(y\) outcome/output/response/dependent variable
\(x\) input/predictors/features
\(f(x)\) prediction function
\(\epsilon\) (irreducible) prediction error
Many competing models
linear models (least squares)
nonlinear models
tree-based methods
support vector machines
neural networks
etc.
Models differ in complexity and interpretability
“All models are wrong, but some are useful” (George Box)
What is the best model choice?
no a priori best model
depends on the data
try out alternatives, and select the best performing
But how to determine which one performs best???
Find the model with minimal MSE:
\[MSE =\frac{1}{n}\sum(y-f(x))^2= \frac{1}{n}\sum\varepsilon^2\]
Increasing model complexity with polynomials
\[f(x)=\beta_0+\beta{x}\]
\[f(x)=\beta_0+\beta_1{x}+\beta_2x^2\]
\[f(x)=\beta_0+\beta_1{x}+\beta_2x^2+\beta_3x^3\]
etc.
What is \(f(x)\)? Is it . . .
Fit all polynomial models to the data
compute the MSE for each model
select the model with the smallest MSE
The model with the smallest MSE most resembles the true \(f(x)\).
TRUE?
Regression lines linear, quadratic, cubic and quartic models
Comparison of the MSE of the fitted models
Fitted models | MSE | \(R^2\) |
---|---|---|
linear | 2.3584 | 0.088 |
quadratic | 2.1713 | 0.16 |
cubic | 0.3526 | 0.864 |
quartic | 0 | 1 |
So the quartic model is the best????
Regression lines for original five data points
MSE of the new data points show different picture!
Fitted models | MSE |
---|---|
linear | 8.7 |
quadratic | 7.1 |
cubic | 658 |
quartic | 11986.5 |
quadratic model performs best
quartic model performs terrible
Data generated from from quadratic model:
\[f(x)=2+x+\epsilon, \ \ \ \ \ \epsilon\sim N(0,4)\]
What did we not find the correct model with the MSE?
simple models are biased (to few predictors)
complex models have high variance (too many predictors)
The expected MSE is composed of bias, variance and irreducible error
\[E(MSE) = bias^2 + variance + \sigma^2\]
Method to find optimal model complexity
We will program it ourselves
d
## x y ## 1 1.796526 6.341385 ## 2 2.116372 4.945655 ## 3 2.718560 1.638660 ## 4 3.724623 3.867489 ## 5 1.605046 3.015605
Getting the indices for the test sets
Use sample()
to make random test sets
folds <- 5 row_id <- 1:5 test_id <- split(x = row_id, f = 1:5) # f is fold 1 to 5 test_id # test_id is a list
## $`1` ## [1] 1 ## ## $`2` ## [1] 2 ## ## $`3` ## [1] 3 ## ## $`4` ## [1] 4 ## ## $`5` ## [1] 5
The training for the 1st fold is
d[test_id[[1]], ]
## x y ## 1 1.796526 6.341385
and the test set for the first fold is
d[-test_id[[1]], ]
## x y ## 2 2.116372 4.945655 ## 3 2.718560 1.638660 ## 4 3.724623 3.867489 ## 5 1.605046 3.015605
Example for the linear model and the first fold
mse_lin <- matrix(nrow = folds) # for storing test MSE's fit <- lm(y ~ x, data = d[-test_id[[1]], ]) # fit model to training set pred <- predict(fit, newdata = d[test_id[[1]], ]) # get prediction on test set mse_lin[1] <- mean((d$y[test_id[[1]]] - pred)^2) # compute test MSE mse_lin
## [,1] ## [1,] 8.748547 ## [2,] NA ## [3,] NA ## [4,] NA ## [5,] NA
fit <- lm(y ~ x, data = d[-test_id[[2]], ]) # fit model to training set pred <- predict(fit, newdata = d[test_id[[2]], ]) # get prediction on test set mse_lin[2] <- mean((d$y[test_id[[2]]] - pred)^2) # compute test MSE mse_lin
## [,1] ## [1,] 8.748547 ## [2,] 1.100311 ## [3,] NA ## [4,] NA ## [5,] NA
for
loopWriting the code for 5 folds is cumbersome
for
loopmse_lin <- matrix(nrow = folds) for (i in 1:folds) { # i takes the value of the consecutive folds fit <- lm(y ~ x, data = d[-test_id[[i]], ]) pred <- predict(fit, newdata = d[test_id[[i]], ]) mse_lin[i] <- mean((d$y[test_id[[i]]] - pred)^2) # test MSE of the i-th fold } mse_lin
## [,1] ## [1,] 8.748547 ## [2,] 1.100311 ## [3,] 7.703967 ## [4,] 14.315792 ## [5,] 5.957239
mse_quad <- matrix(nrow = folds) for (i in 1:folds) { # i takes the value of the consecutive folds fit <- lm(y ~ poly(x, 2), data = d[-test_id[[i]], ]) # poly(x, 2) is x + x^2 pred <- predict(fit, newdata = d[test_id[[i]], ]) mse_quad[i] <- mean((d$y[test_id[[i]]] - pred)^2) # test MSE of the i-th fold } mse_quad
## [,1] ## [1,] 8.133892 ## [2,] 3.153669 ## [3,] 21.568271 ## [4,] 629.114316 ## [5,] 29.080438
mse_cub <- matrix(nrow = folds) for (i in 1:folds) { fit <- lm(y ~ poly(x, 3), data = d[-test_id[[i]], ]) pred <- predict(fit, newdata = d[test_id[[i]], ]) mse_cub[i] <- mean((d$y[test_id[[i]]] - pred)^2) } mse_cub
## [,1] ## [1,] 3.041523 ## [2,] 6.436648 ## [3,] 99.268457 ## [4,] 11218.260959 ## [5,] 13.700736
We can’t fit the quartic model
there are only 4 data points left in the training set
there are at least 5 data points required for the quartic model
With larger data sets this would not be a problem
data.frame(linear = mean(mse_lin), quadratic = mean(mse_quad), cubic = mean(mse_cub), row.names = "Average test MSE")
## linear quadratic cubic ## Average test MSE 7.565171 138.2101 2268.142
So with cross validation we find the correct model
Perform cross validation on the cars
data set