Statistical Programming with R

Content

  1. Generalized linear model (GLM)

    • variables with non-normal error distribution

    • families and link functions

  2. Dichotomous variables

    • binomial family

    • logit link function

    • logistic regression

  3. Fit measures

    • Deviance vs \(R^2\)

\(Y\) is continuous

The linear model applies to a continuous dependent variable \(Y\).


\[\mu=\beta_0+\beta{x}+\varepsilon, \ \ \ \ \ \varepsilon\sim{N}(0,\sigma^2)\]


  • \(\mu\) is the mean of \(Y\) given the score on \(X\).

  • Residuals are normally distributed and homoscedastic.


\(Y\) is dichotomous

Dichotomous variables

- pass/fail the exam (pass = 1, fail = 0)

- smoker/non-smoker (smoker = 1, non-smoker = 0)


Predict the probability \(\mu=P(Y=1)\) with linear model


\[\mu=\beta_0+\beta{x}+\varepsilon, \ \ \ \ \ \varepsilon\sim{Bin}(n,p)\]


Problem:

  • binomial error distribution (non-normal and heteroscedastic)

  • estimates outside the inetrval \((0,1)\)

Example passing the an exam

Distribution of variable pass for 100 students

  • 80 students passed the exam \((pass = 1)\)

  • 20 students failed the exam \((pass = 0)\)

Predictions from the linear model

Predict passing the exam for study time.

Residual plots

Generalized Linear Model

The GLM does not predict \(\mu\) but a function of \(\mu\).

  • The function \(g(\mu)\) is called the link function.


\[g(\mu)=\beta_0+\beta{x}+\varepsilon\]


The link function ensures that predictions are within the permitted range.

  • between 0 and 1 for dichotomous variables


The GLM does not assume normality or homoscedasticity.

Families of distributions

The GLM distinguishes various families of distributions, e.g.:

  • gaussian family for continuous variables

  • binomial family for dichotomous variables


family DV link \(g(\mu)=\beta_0+\beta{x}\) \(\mu=g^{-1}(\beta_0+\beta{x})\)
gaussian continuous identity \(g(\mu)=\mu\) \(\mu=\beta_0+\beta{x}\)
binomial dichotomous logit \(g(\mu)=\log\frac{\mu}{1-\mu}\) \(\mu=\frac{\exp(\beta_0+\beta{x})}{1+\exp(\beta_0+\beta{x})}\)

Fitting a GLM

As lm(), but with additional family argument:

glm(formula, family = c("gaussian", "binomial"), data)


Predictions

predict(object, type = c("link", "response")


  • object is a fitted GLM model

  • type = "link" for prediction of the linear predictor \(g(\mu)=\beta_0+\beta{x}\)

  • type = "response" for prediction of the mean \(\mu=g^{-1}(\beta_0+\beta{x})\)

Probability of passing the exams

Logistic regression model

Call:
glm(formula = pass ~ study, family = binomial)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.28293   0.00184   0.03104   0.15970   1.64812  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -25.633      7.143  -3.589 0.000332 ***
study         10.851      2.948   3.680 0.000233 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 100.080  on 99  degrees of freedom
Residual deviance:  25.818  on 98  degrees of freedom
AIC: 29.818

Number of Fisher Scoring iterations: 8

Interpretation parameter estimates

Parameter estimates are on the logit scale

\[logit(pass)=\log\frac{\mu}{1-\mu}=\beta_0+\beta{x}\]


Probability estimates are obtained via the inverse link):

\[P(pass)=\frac{\exp(\beta_0+\beta{x})}{1+\exp(\beta_0+\beta{x})}\]


Notice that \(P(pass)\) is always between 0 and 1!

Example

Logit for student who studied 2 hrs

\[logit(pass)=\beta_0+2.5\beta_{study}=-25.6 + 2.5\times10.8\approx 0\]

Probability to pass:

\[P(pass)=\frac{\exp(\beta_0+2.5\beta_{study})}{1+\exp(\beta_0+2.5\beta_{study})}=\frac{\exp(0)}{1+\exp(0)}=\frac{1}{2}\]

Regression line

Fit measures lm()

The linear model uses the \(F\) and \(R^2\) statistics

Call:
lm(formula = am ~ disp, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6696 -0.2989  0.0443  0.2786  0.8800 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.9554478  0.1547207   6.175 8.55e-07 ***
disp        -0.0023803  0.0005928  -4.015 0.000366 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4091 on 30 degrees of freedom
Multiple R-squared:  0.3495,    Adjusted R-squared:  0.3279 
F-statistic: 16.12 on 1 and 30 DF,  p-value: 0.0003662

Fit measures glm()

The linear model uses the \(F\) and \(R^2\) statistics

Call:
glm(formula = am ~ disp, family = binomial, data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5651  -0.6648  -0.2460   0.7276   2.2691  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)  2.630849   1.050170   2.505  0.01224 * 
disp        -0.014604   0.005168  -2.826  0.00471 **
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 29.732  on 30  degrees of freedom
AIC: 33.732

Number of Fisher Scoring iterations: 5

Deviance and AIC

Deviance is measure for difference between observed and fitted values

  • Null deviance: deviance of intercept-only model

  • Residual Deviance: deviance of fitted model


The larger the difference, the better the model


  • AIC: model with the lowest AIC is the most parsimonious

Example

Intercept-only model

Call:  glm(formula = pass ~ 1, family = binomial)

Coefficients:
(Intercept)  
      1.386  

Degrees of Freedom: 99 Total (i.e. Null);  99 Residual
Null Deviance:      100.1 
Residual Deviance: 100.1    AIC: 102.1

Model with predictor study

Call:  glm(formula = pass ~ study, family = binomial)

Coefficients:
(Intercept)        study  
     -25.63        10.85  

Degrees of Freedom: 99 Total (i.e. Null);  98 Residual
Null Deviance:      100.1 
Residual Deviance: 25.82    AIC: 29.82

Summary

Generalized Linear Model

  • variables with non-normal error distribution

  • family and link function

  • inverse link function (`type = “response”) is on the original scale of the DV

  • fit is measured by the deviance and AIC