Generalized linear model (GLM)
variables with non-normal error distribution
families and link functions
Dichotomous variables
binomial family
logit link function
logistic regression
Fit measures
- Deviance vs \(R^2\)
Statistical Programming with R
Generalized linear model (GLM)
variables with non-normal error distribution
families and link functions
Dichotomous variables
binomial family
logit link function
logistic regression
Fit measures
The linear model applies to a continuous dependent variable \(Y\).
\[\mu=\beta_0+\beta{x}+\varepsilon, \ \ \ \ \ \varepsilon\sim{N}(0,\sigma^2)\]
\(\mu\) is the mean of \(Y\) given the score on \(X\).
Residuals are normally distributed and homoscedastic.
Dichotomous variables
- pass/fail the exam (pass = 1, fail = 0) - smoker/non-smoker (smoker = 1, non-smoker = 0)
Predict the probability \(\mu=P(Y=1)\) with linear model
\[\mu=\beta_0+\beta{x}+\varepsilon, \ \ \ \ \ \varepsilon\sim{Bin}(n,p)\]
Problem:
binomial error distribution (non-normal and heteroscedastic)
estimates outside the inetrval \((0,1)\)
Distribution of variable pass
for 100 students
80 students passed the exam \((pass = 1)\)
20 students failed the exam \((pass = 0)\)
Predict passing the exam for study time.
The GLM does not predict \(\mu\) but a function of \(\mu\).
\[g(\mu)=\beta_0+\beta{x}+\varepsilon\]
The link function ensures that predictions are within the permitted range.
The GLM does not assume normality or homoscedasticity.
The GLM distinguishes various families of distributions, e.g.:
gaussian family for continuous variables
binomial family for dichotomous variables
family | DV | link | \(g(\mu)=\beta_0+\beta{x}\) | \(\mu=g^{-1}(\beta_0+\beta{x})\) |
---|---|---|---|---|
gaussian | continuous | identity | \(g(\mu)=\mu\) | \(\mu=\beta_0+\beta{x}\) |
binomial | dichotomous | logit | \(g(\mu)=\log\frac{\mu}{1-\mu}\) | \(\mu=\frac{\exp(\beta_0+\beta{x})}{1+\exp(\beta_0+\beta{x})}\) |
As lm()
, but with additional family argument:
glm(formula, family = c("gaussian", "binomial"), data)
Predictions
predict(object, type = c("link", "response")
object
is a fitted GLM model
type = "link"
for prediction of the linear predictor \(g(\mu)=\beta_0+\beta{x}\)
type = "response"
for prediction of the mean \(\mu=g^{-1}(\beta_0+\beta{x})\)
Logistic regression model
Call: glm(formula = pass ~ study, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -2.28293 0.00184 0.03104 0.15970 1.64812 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -25.633 7.143 -3.589 0.000332 *** study 10.851 2.948 3.680 0.000233 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 100.080 on 99 degrees of freedom Residual deviance: 25.818 on 98 degrees of freedom AIC: 29.818 Number of Fisher Scoring iterations: 8
Parameter estimates are on the logit scale
\[logit(pass)=\log\frac{\mu}{1-\mu}=\beta_0+\beta{x}\]
Probability estimates are obtained via the inverse link):
\[P(pass)=\frac{\exp(\beta_0+\beta{x})}{1+\exp(\beta_0+\beta{x})}\]
Notice that \(P(pass)\) is always between 0 and 1!
Logit for student who studied 2 hrs
\[logit(pass)=\beta_0+2.5\beta_{study}=-25.6 + 2.5\times10.8\approx 0\]
Probability to pass:
\[P(pass)=\frac{\exp(\beta_0+2.5\beta_{study})}{1+\exp(\beta_0+2.5\beta_{study})}=\frac{\exp(0)}{1+\exp(0)}=\frac{1}{2}\]
lm()
The linear model uses the \(F\) and \(R^2\) statistics
Call: lm(formula = am ~ disp, data = mtcars) Residuals: Min 1Q Median 3Q Max -0.6696 -0.2989 0.0443 0.2786 0.8800 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.9554478 0.1547207 6.175 8.55e-07 *** disp -0.0023803 0.0005928 -4.015 0.000366 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.4091 on 30 degrees of freedom Multiple R-squared: 0.3495, Adjusted R-squared: 0.3279 F-statistic: 16.12 on 1 and 30 DF, p-value: 0.0003662
glm()
The linear model uses the \(F\) and \(R^2\) statistics
Call: glm(formula = am ~ disp, family = binomial, data = mtcars) Deviance Residuals: Min 1Q Median 3Q Max -1.5651 -0.6648 -0.2460 0.7276 2.2691 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.630849 1.050170 2.505 0.01224 * disp -0.014604 0.005168 -2.826 0.00471 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 43.230 on 31 degrees of freedom Residual deviance: 29.732 on 30 degrees of freedom AIC: 33.732 Number of Fisher Scoring iterations: 5
Deviance is measure for difference between observed and fitted values
Null deviance
: deviance of intercept-only model
Residual Deviance
: deviance of fitted model
The larger the difference, the better the model
AIC
: model with the lowest AIC is the most parsimoniousIntercept-only model
Call: glm(formula = pass ~ 1, family = binomial) Coefficients: (Intercept) 1.386 Degrees of Freedom: 99 Total (i.e. Null); 99 Residual Null Deviance: 100.1 Residual Deviance: 100.1 AIC: 102.1
Model with predictor study
Call: glm(formula = pass ~ study, family = binomial) Coefficients: (Intercept) study -25.63 10.85 Degrees of Freedom: 99 Total (i.e. Null); 98 Residual Null Deviance: 100.1 Residual Deviance: 25.82 AIC: 29.82
Generalized Linear Model
variables with non-normal error distribution
family and link function
inverse link function (`type = “response”) is on the original scale of the DV
fit is measured by the deviance and AIC