This article is part of the R for Researchers series. For a list of topics covered by this series, see the Introduction article. If you're new to R we highly recommend reading the articles in order.

This article will introduce you to specifying the the link and variance function for a generalized linear model (GLM.) The article provides example models for binary, poisson, quasipoisson, and negative binomial models. The article also provides a diagnostic method to examine the variance assumption of a GLM model.

You will get the most from this article if you follow along with the examples in RStudio. Working the exercise will further enhance your skills with the material. The following steps will prepare your RStudio session to run this article's examples.

- Start RStudio and open your RFR project.
- Confirm that RFR (the name of your project) is displayed in the upper left corner of the RStudio window.
- Open new R script.
- Save the script as glm.R.
Enter the following command in your script and run it.

`####################################################### ####################################################### ## ## Environment Setup ## ####################################################### ####################################################### library(faraway) # glm support library(MASS) # negative binomial support library(car) # regression functions library(lme4) # random effects library(ggplot2) # plotting commands library(reshape2) # wide to tall reshaping library(xtable) # nice table formatting library(knitr) # kable table formatting library(grid) # units function for ggplot saveDir <- getwd() # get the current working directory wd <- "u:/RFR" # path to my project setwd(wd) # set this path as my work directory`

There are no console results from these commands.

Generalized linear models (GLM) are useful when the range of your response variable is constrained and/or the variance is not constant or normally distributed. GLM models transform the response variable to allow the fit to be done by least squares. The transformation done on the response variable is defined by the **link** function. This transformation of the response may constrain the range of the response variable. The **variance** function specifies the relationship of the variance to the mean. In R a **family** specifies the variance and link functions which are used in the model fit. As an example the family poisson uses the "log" link function and "\(\mu\)" as the variance function. A GLM model is defined by both the formula and the family.

GLM models can also be used to fit data in which the variance is proportional to one of the defined variance functions. This is done with quasi families. Pearson's \(\chi^2\) is used to scale the variance in the quasi families. An example would be data in which the variance is proportional to the mean. This would use the "quasipoisson" family. This results in a variance function of \(\alpha \mu\) instead of \(1 \mu\) as for Poission distributed data. The quasi families allows inference to be done when your data is over or underdispersed, provided that the variance is proportion.

The default link function for a family can be changed by specifying a link to the family function. For example, if the response variable is non negative and the variance is proportional to the mean, you would use the "identity" link with the "quasipoisson" family function. This would be specified as

`family=quasipoisson(link="identity")`

The variance function is specified by the family.

The decision of which family is appropriate is not discussed in this series.

GLM models have a defined relationship between the expected variance and the mean. This relationship can be used to evaluate the model's goodness of fit to the data. The deviance can be used for this goodness of fit check. Under asymptotic conditions the deviance is expected to be \(\chi^2_{df}\) distributed. Pearson's \(\chi^2\) can also be used for this measure of goodness of fit, though it is the deviance which is minimized when fitting a GLM model.

There are some limits to the goodness of fit evaluation.

- When the response data is binary, the deviance approximations are not even approximately correct.
- The deviance approximations are also not useful when there are small group sizes.
- The goodness of fit tests using deviance or Pearson's are not applicable with a quasi family model.

Residual plots are useful for some GLM models and much less useful for others. When residuals are useful in the evaluation a GLM model, the plot of Pearson's residuals versus the fitted link values is typically the most helpful. The Pearson's residuals are normalized by the variance and are expected to then be constant across the prediction range. Pearson's residuals and the fitted link values are obtained by extractor functions.

Syntax and use of the

**type**parameter in residuals() and predict().residuals(modObj, type=

*resType*)Returns a vector of the residuals. The scale of the residuals is determined by the type parameter

*ResType*can be set to "deviance", "pearson", "working", "response", or "partial".predict(modObj, type="fitType")

Returns a vector of fitted values. The scale of the residuals is determined by the type parameter.

*fitType*can be set to "link", "response" or "terms".

Variable selection for a GLM model is similar to the process for an OLS model. Nested model tests for significance of a coefficient are preferred to Wald test of coefficients. This is due to GLM coefficients standard errors being sensitive to even small deviations from the model assumptions. It is also more accurate to take p-values for the GLM coefficients from nested model tests.

The likelihood ratio test (LRT) is typically used to test nested models. For quasi family models an F test is used for nested model tests (or when the fit is overdispersed or underdispersed.) This use of the F statistic is appropriate if the group sizes are approximately equal.

Which variable to select for a model may depend on the family that is being used in the model. In these cases variable selection is connected with family selection. Variable selection criteria such as AIC and BIC are generally not applicable for selecting between families.

Binary data, like binomial data, is typically modelled with the logit link and variance function \(\mu(1-\mu)\). The modelled response is the predicted log odds of an event.

We will use the hsb dataset from the faraway package for our binary response model. This dataset is a subset of a National Education Longitudinal Studies dataset. We will start by loading the data.frame and taking a look at the variables.

Enter the following commands in your script and run them.

`data(hsb) str(hsb)`

The results of the above commands are shown below.

`'data.frame': 200 obs. of 11 variables: $ id : int 70 121 86 141 172 113 50 11 84 48 ... $ gender : Factor w/ 2 levels "female","male": 2 1 2 2 2 2 2 2 2 2 ... $ race : Factor w/ 4 levels "african-amer",..: 4 4 4 4 4 4 1 3 4 1 ... $ ses : Factor w/ 3 levels "high","low","middle": 2 3 1 1 3 3 3 3 3 3 ... $ schtyp : Factor w/ 2 levels "private","public": 2 2 2 2 2 2 2 2 2 2 ... $ prog : Factor w/ 3 levels "academic","general",..: 2 3 2 3 1 1 2 1 2 1 ... $ read : int 57 68 44 63 47 44 50 34 63 57 ... $ write : int 52 59 33 44 52 52 59 46 57 55 ... $ math : int 41 53 54 47 57 51 42 45 54 52 ... $ science: int 47 63 58 53 53 63 53 39 58 50 ... $ socst : int 57 61 31 56 61 61 61 36 51 51 ...`

We will model the odds of a student's program of choice being "academic" as our response variable. We will use all the other variables in the dataset as independent variables. We will ignore interactions here to focus on the GLM fitting process.

Enter the following commands in your script and run them.

`g1 <- glm(I(prog=="academic")~gender+race+ses+schtyp+ read+write+science+socst, family=binomial(), data=hsb) summary(g1)`

The results of the above commands are shown below.

`Call: glm(formula = I(prog == "academic") ~ gender + race + ses + schtyp + read + write + science + socst, family = binomial(), data = hsb) Deviance Residuals: Min 1Q Median 3Q Max -2.1823 -0.8645 0.3223 0.8676 2.0749 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.94995 1.49602 -3.309 0.000937 *** gendermale 0.23634 0.37889 0.624 0.532779 raceasian -0.11219 0.92526 -0.121 0.903490 racehispanic 0.52553 0.71902 0.731 0.464842 racewhite -0.14508 0.61364 -0.236 0.813101 seslow -0.69719 0.51813 -1.346 0.178437 sesmiddle -0.90940 0.41622 -2.185 0.028896 * schtyppublic -1.16715 0.49695 -2.349 0.018843 * read 0.06760 0.02596 2.604 0.009208 ** write 0.05920 0.02801 2.114 0.034540 * science -0.05286 0.02612 -2.023 0.043029 * socst 0.05148 0.02259 2.279 0.022648 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 276.76 on 199 degrees of freedom Residual deviance: 211.26 on 188 degrees of freedom AIC: 235.26 Number of Fisher Scoring iterations: 4`

The summary output for a GLM models displays the call, residuals, and coefficients similar to an LM object. The model information at the bottom of the output is different. For a GLM model the dispersion parameter and deviance values are provided. Here we have a set dispersion value of 1, since we are not working with a quasi family.

We will not check the model fit with a test of the residual deviance, since the distribution is not expected to be \(\chi^2_{df}\) distributed.

We will use step with the criteria being the likelihood ratio test (LRT) to reduce unneeded variables from the model.

Enter the following command in your script and run it.

`step(g1, test="LRT")`

The results of the above commands are shown below.

`step(g1, test="LRT")`

`Start: AIC=235.26 I(prog == "academic") ~ gender + race + ses + schtyp + read + write + science + socst Df Deviance AIC LRT Pr(>Chi) - race 3 212.64 230.64 1.3789 0.710487 - gender 1 211.65 233.65 0.3906 0.531964 <none> 211.26 235.26 - ses 2 216.23 236.23 4.9660 0.083492 . - science 1 215.52 237.52 4.2646 0.038916 * - write 1 215.90 237.90 4.6386 0.031260 * - socst 1 216.66 238.66 5.4048 0.020082 * - schtyp 1 217.29 239.29 6.0261 0.014096 * - read 1 218.41 240.41 7.1487 0.007502 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Step: AIC=230.64 I(prog == "academic") ~ gender + ses + schtyp + read + write + science + socst Df Deviance AIC LRT Pr(>Chi) - gender 1 213.18 229.18 0.5399 0.46246 <none> 212.64 230.64 - ses 2 217.43 231.43 4.7867 0.09132 . - write 1 217.06 233.06 4.4201 0.03552 * - schtyp 1 218.19 234.19 5.5518 0.01846 * - science 1 218.20 234.20 5.5653 0.01832 * - socst 1 218.33 234.33 5.6894 0.01707 * - read 1 219.39 235.39 6.7547 0.00935 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Step: AIC=229.18 I(prog == "academic") ~ ses + schtyp + read + write + science + socst Df Deviance AIC LRT Pr(>Chi) <none> 213.18 229.18 - ses 2 218.13 230.13 4.9494 0.084189 . - write 1 217.06 231.06 3.8857 0.048700 * - science 1 218.25 232.25 5.0752 0.024271 * - schtyp 1 218.78 232.78 5.6008 0.017953 * - socst 1 218.85 232.85 5.6742 0.017216 * - read 1 220.55 234.55 7.3690 0.006636 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

`Call: glm(formula = I(prog == "academic") ~ ses + schtyp + read + write + science + socst, family = binomial(), data = hsb) Coefficients: (Intercept) seslow sesmiddle schtyppublic read -4.38958 -0.67766 -0.89902 -1.10394 0.06794 write science socst 0.04845 -0.05345 0.05174 Degrees of Freedom: 199 Total (i.e. Null); 192 Residual Null Deviance: 276.8 Residual Deviance: 213.2 AIC: 229.2`

We can now fit the model suggested by step.

Enter the following commands in your script and run them.

`g2 <- glm(I(prog == "academic") ~ ses + schtyp + read + write + science + socst, family = "binomial", data = hsb) summary(g2)`

The results of the above commands are shown below.

`Call: glm(formula = I(prog == "academic") ~ ses + schtyp + read + write + science + socst, family = "binomial", data = hsb) Deviance Residuals: Min 1Q Median 3Q Max -2.1132 -0.8912 0.3224 0.8909 2.0947 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.38958 1.37564 -3.191 0.00142 ** seslow -0.67766 0.49931 -1.357 0.17472 sesmiddle -0.89902 0.41186 -2.183 0.02905 * schtyppublic -1.10394 0.48643 -2.269 0.02324 * read 0.06794 0.02570 2.644 0.00820 ** write 0.04845 0.02494 1.942 0.05210 . science -0.05345 0.02434 -2.196 0.02810 * socst 0.05174 0.02214 2.337 0.01946 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 276.76 on 199 degrees of freedom Residual deviance: 213.18 on 192 degrees of freedom AIC: 229.18 Number of Fisher Scoring iterations: 4`

Note there are differences between the p-values reported in summary and what was reported the the LRT test in the final step of the step() function above.

The quasibinomial family is useful for modelling response variables with a bounded range.

Residual plots provide little assistance in evaluating binary models. The diagnostics for the sensitivity of the model to the data are checked checked using the same methods as was done for OLS models.

Count data is typically modelled using the poisson family. This uses a log link function and a variance function of \(\mu\). The modelled response is the predicted log count.

We will use the discoveries dataset from the datasets package for our binary response model. This dataset is the number of great inventions for the years 1860 to 1959. We will start by loading the data.frame and adding a variable to represent the number of years since 1860. We will assume that there is no correlation between the years to focus on the GLM model fit.

Enter the following commands in your script and run them.

`data(discoveries) disc <- data.frame(count=as.numeric(discoveries), year=seq(0,(length(discoveries)-1),1) )`

There are no console results from the above commands.

We will fit the count of inventions with year and year squared.

Enter the following commands in your script and run them.

`yearSqr=disc$year^2 p1 <- glm(count~year+yearSqr, family="poisson", data=disc) summary(p1)`

The results of the above commands are shown below.

`Call: glm(formula = count ~ year + yearSqr, family = "poisson", data = disc) Deviance Residuals: Min 1Q Median 3Q Max -2.9066 -0.8397 -0.2544 0.4776 3.3303 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 7.592e-01 1.814e-01 4.186 2.84e-05 *** year 3.356e-02 8.499e-03 3.948 7.87e-05 *** yearSqr -4.106e-04 8.699e-05 -4.720 2.35e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 164.68 on 99 degrees of freedom Residual deviance: 132.84 on 97 degrees of freedom AIC: 407.85 Number of Fisher Scoring iterations: 5`

We can check the goodness of fit of this model. We will use the deviance of the residuals for this test.

Enter the following commands in your script and run them.

`1 - pchisq(deviance(p1),df.residual(p1))`

The results of the above commands are shown below.

`[1] 0.009204575`

The p-value is approximately .001. This deviance is not likely to have occurred by chance, under the null hypothesis of the deviances being \(\chi^2\). Therefore we have evidence of overdispersion. The presence of overdispersion suggested the use of the F-test for nested models. We will test if the squared term can be dropped from the model.

Enter the following command in your script and run it.

`drop1(p1, test="F")`

The results of the above command are shown below.

`Warning in drop1.glm(p1, test = "F"): F test assumes 'quasipoisson' family`

`Single term deletions Model: count ~ year + yearSqr Df Deviance AIC F value Pr(>F) <none> 132.84 407.85 year 1 149.66 422.67 12.286 0.0006927 *** yearSqr 1 157.32 430.32 17.874 5.345e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

The p-value for yearSqr is small (.0005.) We will retain the yearSqr term in the model.

The invention count model from above needs to be fit using the quasipoisson family. The quasipoisson family will account for the greater variance in the data.

Enter the following commands in your script and run them.

`p2 <- glm(count~year+yearSqr, family="quasipoisson", data=disc) summary(p2)`

The results of the above commands are shown below.

`Call: glm(formula = count ~ year + yearSqr, family = "quasipoisson", data = disc) Deviance Residuals: Min 1Q Median 3Q Max -2.9066 -0.8397 -0.2544 0.4776 3.3303 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.7592473 0.2072715 3.663 0.000406 *** year 0.0335569 0.0097112 3.455 0.000816 *** yearSqr -0.0004106 0.0000994 -4.131 7.66e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for quasipoisson family taken to be 1.305649) Null deviance: 164.68 on 99 degrees of freedom Residual deviance: 132.84 on 97 degrees of freedom AIC: NA Number of Fisher Scoring iterations: 5`

`[1] 1.305649`

There is no change in the estimated coefficient between the quasipoisson fit and the poisson fit. The significance of the terms does change and a dispersion parameter is estimated.

Before determining that the quasipoisson family is appropriate, we will check to see if the variance of the residuals is proportional to the mean. We begin this check by creating a new data.frame which includes the residuals and fitted values.

Enter the following command in your script and run it.

`p1Diag <- data.frame(disc, link=predict(p1, type="link"), fit=predict(p1, type="response"), pearson=residuals(p1,type="pearson"), resid=residuals(p1,type="response"), residSqr=residuals(p1,type="response")^2 )`

There are no console results for the above command.

We will plot the square of the residual to the predicted mean. We will add to this scatter plot a black line for the Poisson assumed variance, a green line for the quasi-Poisson assumed variance, and a blue curve for the smoothed mean of the square of the residual.

Enter the following command in your script and run it.

`ggplot(data=p1Diag, aes(x=fit, y=residSqr)) + geom_point() + geom_abline(intercept = 0, slope = 1) + geom_abline(intercept = 0, slope = summary(p2)$dispersion, color="green") + stat_smooth(method="loess", se = FALSE) + theme_bw()`

There are no console results from this command. The following plot is produced.

Ideally the blue curve would be straight and it would be collinear with the green line for the quasi-Poisson variance. The greater the deviation from the green line the greater the concern is about the proportionality of the variance to the mean. Here we have some indication that the variance may not be proportional to the mean.

We will now look to see if a negative binomial model might be a better fit. The negative binomial requires the use of the glm.nb() function. The call to glm.nb is similar to glm, except no family is given.

Enter the following commands in your script and run them.

`nb1 <- glm.nb(count~year+yearSqr, data=disc) summary(nb1)`

The results of the above commands are shown below.

`Call: glm.nb(formula = count ~ year + yearSqr, data = disc, init.theta = 11.53157519, link = log) Deviance Residuals: Min 1Q Median 3Q Max -2.6849 -0.7707 -0.2344 0.4232 2.7134 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 7.526e-01 2.019e-01 3.727 0.000194 *** year 3.386e-02 9.471e-03 3.575 0.000350 *** yearSqr -4.132e-04 9.627e-05 -4.292 1.77e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Negative Binomial(11.5316) family taken to be 1) Null deviance: 131.91 on 99 degrees of freedom Residual deviance: 106.20 on 97 degrees of freedom AIC: 406.21 Number of Fisher Scoring iterations: 1 Theta: 11.53 Std. Err.: 7.42 2 x log-likelihood: -398.214`

The coefficients have only a small change from the quasi-Poisson model.

The drop1 function is used to test the significance of the squared term for year. We use the likelihood ratio test for negative binomial models.

Enter the following command in your script and run it.

`drop1(nb1, test="LRT")`

The results of the above command are shown below.

`Single term deletions Model: count ~ year + yearSqr Df Deviance AIC LRT Pr(>Chi) <none> 106.20 404.21 year 1 119.54 415.55 13.338 0.00026 *** yearSqr 1 125.79 421.80 19.589 9.602e-06 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

The squared term is significant and is retained in the model.

We will repeat the check of the variance of the residuals which was done for the quasi-Poisson model. Plotting the square of the residual to the fitted values, with a black line for Poisson, green line for quasi-Poisson, a blue curve for smoothed mean of the square of the residual, and a red curve for predicted variance from the negative binomial fit.

Enter the following command in your script and run it.

`nb1Diag <- data.frame(disc, link=predict(nb1, type="link"), fit=predict(nb1, type="response"), pearson=residuals(nb1,type="pearson"), resid=residuals(nb1,type="response"), residSqr=residuals(nb1,type="response")^2 ) ggplot(data=nb1Diag, aes(x=fit, y=residSqr)) + geom_point() + geom_abline(intercept = 0, slope = 1) + geom_abline(intercept = 0, slope = summary(p2)$dispersion, color="green") + stat_function(fun=function(fit){fit + fit^2/11.53}, color="red") + stat_smooth(method="loess", se = FALSE) + theme_bw()`

There are no console results from this command. The following plot is produced.

The negative binomial variance curve (red) is close to the quasi-Poisson line (green.)

Although the means and variance predictions for the negative binomial and quasi-Poisson models are similar, the probability for any given integer is different for the two models. The following code shows the predicted probabilities of 0 through 7 when the mean is predicted to be 4.

Enter the following command in your script and run it.

`data.frame( number=0:8, prob_Poisson=round( dpois(0:8,(4*summary(p2)$dispersion) ), 3 ), prob_NBinom=round( dnbinom(0:8,mu=4,size=summary(nb1)$theta), 3 ) ) summary(nb1)$theta`

The results of the above command are shown below.

`number prob_Poisson prob_NBinom 1 0 0.005 0.032 2 1 0.028 0.096 3 2 0.074 0.155 4 3 0.128 0.180 5 4 0.167 0.168 6 5 0.175 0.134 7 6 0.152 0.095 8 7 0.113 0.062 9 8 0.074 0.037`

`[1] 11.53158`

The interpretation of the two models is different as well as the probabilities of the event counts. Examining the diagnostics would be useful step in choosing between these two models.

Residual plots of the pearsons residuals to the link function have some utility for count data. The diagnostics for the sensitivity of the model to the data are checked using the same methods as was done for OLS models.

Use the following code to load the warpbreaks data set and examine the variables in the data set.

`data(warpbreaks) str(warpbreaks) plot(warpbreaks)`

Use the poisson family and fit breaks with wool, tension, and their interaction.

Check to see if this is an appropriate model. If not, choose a more appropriate model form.

Use the backward selection method to reduce your model, if possible. Use your model from the prior problem as the starting model.

Check the residual variance assumption for your model.

Next: Regression Inference

Previous: Regression Diagnostics

Last Revised: 3/19/2014