Generalized linear models

Statistical Programming with R

Content

Generalized linear model (GLM)
- variables with non-normal error distribution
- families and link functions
Dichotomous variables
- binomial family
- logit link function
- logistic regression
Fit measures
- Deviance vs \(R^2\)

\(Y\) is continuous

The linear model applies to a continuous dependent variable \(Y\).

\[\mu=\beta_0+\beta{x}+\varepsilon, \ \ \ \ \ \varepsilon\sim{N}(0,\sigma^2)\]

\(\mu\) is the mean of \(Y\) given the score on \(X\).
Residuals are normally distributed and homoscedastic.

\(Y\) is dichotomous

Dichotomous variables

- pass/fail the exam (pass = 1, fail = 0)

- smoker/non-smoker (smoker = 1, non-smoker = 0)

Predict the probability \(\mu=P(Y=1)\) with linear model

\[\mu=\beta_0+\beta{x}+\varepsilon, \ \ \ \ \ \varepsilon\sim{Bin}(n,p)\]

Problem:

binomial error distribution (non-normal and heteroscedastic)
estimates outside the inetrval \((0,1)\)

Example passing the an exam

Distribution of variable pass for 100 students

80 students passed the exam \((pass = 1)\)
20 students failed the exam \((pass = 0)\)

Predictions from the linear model

Predict passing the exam for study time.

Residual plots

Generalized Linear Model

The GLM does not predict \(\mu\) but a function of \(\mu\).

The function \(g(\mu)\) is called the link function.

\[g(\mu)=\beta_0+\beta{x}+\varepsilon\]

The link function ensures that predictions are within the permitted range.

between 0 and 1 for dichotomous variables

The GLM does not assume normality or homoscedasticity.

Families of distributions

The GLM distinguishes various families of distributions, e.g.:

gaussian family for continuous variables
binomial family for dichotomous variables

family	DV	link	\(g(\mu)=\beta_0+\beta{x}\)	\(\mu=g^{-1}(\beta_0+\beta{x})\)
gaussian	continuous	identity	\(g(\mu)=\mu\)	\(\mu=\beta_0+\beta{x}\)
binomial	dichotomous	logit	\(g(\mu)=\log\frac{\mu}{1-\mu}\)	\(\mu=\frac{\exp(\beta_0+\beta{x})}{1+\exp(\beta_0+\beta{x})}\)

Fitting a GLM

As lm(), but with additional family argument:

glm(formula, family = c("gaussian", "binomial"), data)

Predictions

predict(object, type = c("link", "response")

object is a fitted GLM model
type = "link" for prediction of the linear predictor \(g(\mu)=\beta_0+\beta{x}\)
type = "response" for prediction of the mean \(\mu=g^{-1}(\beta_0+\beta{x})\)

Probability of passing the exams

Logistic regression model

Call:
glm(formula = pass ~ study, family = binomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.28293   0.00184   0.03104   0.15970   1.64812  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -25.633      7.143  -3.589 0.000332 ***
study         10.851      2.948   3.680 0.000233 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 100.080  on 99  degrees of freedom
Residual deviance:  25.818  on 98  degrees of freedom
AIC: 29.818

Number of Fisher Scoring iterations: 8

Interpretation parameter estimates

Parameter estimates are on the logit scale

\[logit(pass)=\log\frac{\mu}{1-\mu}=\beta_0+\beta{x}\]

Probability estimates are obtained via the inverse link):

\[P(pass)=\frac{\exp(\beta_0+\beta{x})}{1+\exp(\beta_0+\beta{x})}\]

Notice that \(P(pass)\) is always between 0 and 1!

Example

Logit for student who studied 2 hrs

\[logit(pass)=\beta_0+2.5\beta_{study}=-25.6 + 2.5\times10.8\approx 0\]

Probability to pass:

\[P(pass)=\frac{\exp(\beta_0+2.5\beta_{study})}{1+\exp(\beta_0+2.5\beta_{study})}=\frac{\exp(0)}{1+\exp(0)}=\frac{1}{2}\]

Regression line

Fit measures `lm()`

The linear model uses the \(F\) and \(R^2\) statistics

Call:
lm(formula = am ~ disp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6696 -0.2989  0.0443  0.2786  0.8800 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.9554478  0.1547207   6.175 8.55e-07 ***
disp        -0.0023803  0.0005928  -4.015 0.000366 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4091 on 30 degrees of freedom
Multiple R-squared:  0.3495,    Adjusted R-squared:  0.3279 
F-statistic: 16.12 on 1 and 30 DF,  p-value: 0.0003662

Fit measures `glm()`

The linear model uses the \(F\) and \(R^2\) statistics

Call:
glm(formula = am ~ disp, family = binomial, data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5651  -0.6648  -0.2460   0.7276   2.2691  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  2.630849   1.050170   2.505  0.01224 * 
disp        -0.014604   0.005168  -2.826  0.00471 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 29.732  on 30  degrees of freedom
AIC: 33.732

Number of Fisher Scoring iterations: 5

Deviance and AIC

Deviance is measure for difference between observed and fitted values

Null deviance: deviance of intercept-only model
Residual Deviance: deviance of fitted model

The larger the difference, the better the model

AIC: model with the lowest AIC is the most parsimonious

Example

Intercept-only model

Call:  glm(formula = pass ~ 1, family = binomial)

Coefficients:
(Intercept)  
      1.386  

Degrees of Freedom: 99 Total (i.e. Null);  99 Residual
Null Deviance:      100.1 
Residual Deviance: 100.1    AIC: 102.1

Model with predictor study

Call:  glm(formula = pass ~ study, family = binomial)

Coefficients:
(Intercept)        study  
     -25.63        10.85  

Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
Null Deviance:      100.1 
Residual Deviance: 25.82    AIC: 29.82

Summary

Generalized Linear Model

variables with non-normal error distribution
family and link function
inverse link function (`type = “response”) is on the original scale of the DV
fit is measured by the deviance and AIC

Content

\(Y\) is continuous

\(Y\) is dichotomous

Example passing the an exam

Predictions from the linear model

Residual plots

Generalized Linear Model

Families of distributions

Fitting a GLM

Probability of passing the exams

Interpretation parameter estimates

Example

Regression line

Fit measures lm()

Fit measures glm()

Deviance and AIC

Example

Summary

Fit measures `lm()`

Fit measures `glm()`