This article is part of the Mixed Model series. For a list of topics covered by this series, see the Introduction article. If you're new to mixed models we highly recommend reading the articles in order.

Models with random effects do not have classic asymptotic theory which one can appeal to for inference. There currently is debate among good statisticians as to what statistical tools are appropriate to evaluate these models and to use for inference. This article presents some of the more useful tools currently available while noting the limitations of these tools.

Mixed model parameters do not have nice asymptotic distributions to test against. This is in contrast to OLS parameters, and to some extent GLM parameters, which asymptotically converge to known distributions. This complicates the inferences which can be made from mixed models.

One source of the complexity is a penalty factor (shrinkage) which is applied to the random effects in the calculation of the likelihood (or restricted likelihood) function the model is optimized to. This results in distributions which are no longer chi squared or F. This penalty factor also complicates determining the degrees of freedom to associate with the estimate of a random effect. For example a variance parameter, say r1, maybe estimated from twenty levels in a model. The design matrix used to estimate the model parameters uses twenty indicator variables for these twenty levels. There is only one parameter for these twenty indicators in the model. We would typically associate one degree of freedom with one estimated value. The design matrix used includes the twenty indicator variables and we would normally associate twenty degrees of freedom with these twenty indicators. Since these twenty indicators have a shrinkage factor applied to them, we do not really need twenty degrees of freedom. So what would be the correct degrees of freedom to use for the cost to estimate this one variance parameter? One? twenty? Something in between? Unfortunately there is no generally accepted theory which provides an answer to this question. Assuming we can find a good value for the degrees of freedom, we still can not count on our test statistic (from likelihood ratio tests and the like) to be F or chi distributed with the penalty applied to the model.

Another source of complications is testing the significance of a variance parameter, \(\boldsymbol{\theta}\). Since the variance must be greater than or equal to zero, a test of zero is on the border of the parameter space. Tests of parameters are valid only on the interior of their space and not on the border.

The correlation structure within the data complicates using bootstrap procedures to test these statistics which do not have known distributions. Parametric bootstraps which can more easily account for the correlation in the model are more typically used for inference in mixed models than bootstraps, which are non-parametric.

This article does not discuss the details of these issues. These issues are raised to let you to know about the limitations of inference with mixed models and what can be done. Although the asymptotics of mixed models are not as classically clean as in OLS and GLM models, inference can still be useful to guide decisions. Recall that all real world (finite non-asymptotic) statistics are estimates and one of the goals of statistics is to quantify the uncertainty of estimates. With this understanding, this article will demonstrate some of the inference tools available in R to quantify the uncertainty in the estimates from lme4 mixed models. You will need to consider the assumptions and limitations of these tools in your work.

Tests of fixed effects are typically done with either Wald or likelihood ratio (LRT) tests. With the assumptions of asymptotic distributions and independent predictors, Wald and LRT tests are equivalent. When a data set size is not large enough to be a good approximation of the asymptotic distribution or there is some correlation amongst the predictors, the Wald and LRT test results can vary considerably.

How large the data set needs to be for the asymptotic distribution to be a good approximation depends not only on how many observations you have, but also on the response variable type and the size of subgroups of observations formed by the categorical variables in the model. With a continuous response variable in a linear mixed model, subgroup sizes as small as five may be enough for the Wald and LRT to be similar. When the response is an indicator variable and the proportion of events of interest is small, groups size of one hundred may not be large enough for the Wald and LRT results to be similar.

The Wald test is based only on estimates from the model being evaluated. This results in an implied assumption that a model which holds the parameter being tested to zero will be the same with the exception of the parameter which is being tested. Correlation between the tested predictor and the other model predictors, can cause the estimate made from the model including the parameter to be different from a model which holds the parameter to zero. The LRT requires the formal estimation of a model which restricts the parameter to zero and therefore accounts for correlation in its test.

The LRT is generally preferred over Wald tests of fixed effects in mixed models. For linear mixed models with little correlation among predictors, a Wald test using the approach of Kenward and Rogers (1997) will be quite similar to LRT test results. The SSCC does not recommend the use of Wald tests for generalized models.

The most reliable inferences for mixed models are done with Markov Chain Monte Carlo (MCMC) and parametric bootstrap tests. Both of these are computationally expensive and require longer run times. The parametric bootstrap is more intuitive and easier to generally apply. This article will demonstrate the use of parametric bootstrap in some examples. MCMC will not be discussed in this article.

Some common test available for lme4 models are listed below. The tests are listed from least efficient to most efficient.

Wald test. Tests of the effect size which is scaled using the estimated standard error.

LRT (Likelihood Ratio Test.) Tests the difference in two nested models using the Chi square distribution.

KRmodComp. Only available for linear mixed models (does not support glmer() models.) An F test of nested models with an estimated degrees of freedom. The KRmodcomp() function estimates which F-test distribution is the best distribution from the family of F distributions. This function addresses the degrees of freedom concern.

Profiled confidence interval. While not a formal test, an interval which does not contain zero indicates the parameter is significant.

Parametric bootstrap. Assumes the model which restricts a parameter to zero (null model) is the true distribution and generates an empirical distribution of the difference in the two models. The observed coefficient is tested against the generated empirical distribution.

Since the distributions of coefficients are only approximately asymptotical, two or more of the above are generally done to confirm results of tests that are inconclusive.

A Walt test is done by comparing the coefficient's estimated value with the estimated standard error for the coefficient. This is provided by the summary of glmer() model where the coefficient's estimate is expected to be normally distributed (z-test.) The Wald test is not provided with the summary of lmer() models.

The Likelihood Ratio Test (LRT) of fixed effects requires the models be fit with by MLE (use REML=FALSE for linear mixed models.) The LRT of mixed models is only approximately \(\chi^2\) distributed. For tests of fixed effects the p-values will be smaller. Thus if a p-value is greater than the cutoff value, you can be confident that a more accurate test would also retain the null hypothesis. For p-values that are only a little below the cutoff value, a more accurate approach would need to be used.

There are several R functions which can be used for the LRT. Two of these, drop1() and anova(),are used here to test if the x1 coefficient is zero.

The LRT using drop() requires the test parameter be set to "Chisq". Enter the following command in your script and run it.

`drop1(gmm,test="Chisq")`

The results of the above command are shown below.

`Single term deletions Model: bin ~ x1 + x2 + (1 | g1) Df AIC LRT Pr(Chi) <none> 112.97 x1 1 114.08 3.1009 0.07825 . x2 1 135.56 24.5834 7.116e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

To use anova for the LRT, the nested model (without x1) needs to be generated. Enter the following commands in your script and run them.

`# fit nested model and LRT uing anova gmmDx1 <- glmer(bin~ x2 + (1|g1), family=binomial, data=pbDat) anova(gmm,gmmDx1,test="Chisq")`

The results of the above commands are shown below.

`Data: pbDat Models: gmmDx1: bin ~ x2 + (1 | g1) gmm: bin ~ x1 + x2 + (1 | g1) Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) gmmDx1 3 114.08 121.89 -54.038 108.08 gmm 4 112.97 123.40 -52.488 104.97 3.1009 1 0.07825 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

The p-value from both test are the same, 0.0783. This is larger than the common cut off alpha level of .05. Since this is the lowest we would expect the p-value to be, we have determined that the coefficient is not significant. If the p-value was inconclusive, a parametric bootstrap could be used to provide a better estimated p-value.

The p-value from the Wald test in the summary of the gmm model is 0.0939. This is about 20% larger than the 0.0783, p-value from the LRT.

The KRmodComp() function does not support generalized models. To demonstrate this function, we will create a lmer() model using the continuous y response in the pbDat data set.

Enter the following commands in your script and run them.

`mm <- lmer(y~x1 + x2 + (1|g1), data=pbDat) mmMLE <- lmer(y~x1 + x2 + (1|g1), REML=FALSE, data=pbDat) mmDx1 <- lmer(y~ x2 + (1|g1), REML=FALSE, data=pbDat) KRmodcomp(mmMLE,mmDx1) anova(mmMLE,mmDx1,test="Chisq")`

The results of the above commands are shown below.

`F-test with Kenward-Roger approximation; computing time: 0.13 sec. large : y ~ x1 + x2 + (1 | g1) small : y ~ x2 + (1 | g1) stat ndf ddf F.scaling p.value Ftest 4.4925 1.0000 87.5747 1 0.03687 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

`Data: pbDat Models: mmDx1: y ~ x2 + (1 | g1) mmMLE: y ~ x1 + x2 + (1 | g1) Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) mmDx1 4 1215.8 1226.2 -603.91 1207.8 mmMLE 5 1213.4 1226.4 -601.69 1203.4 4.4586 1 0.03473 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1`

These models are not comparable to the glmer() example models. These were constructed only to demonstrate the KRmodcomp() function. The KRmodcomp p-value and the LRT p-value are similar.

The profiled confidence interval, while not a formal test, provides another method of evaluating the significance of a coefficient. The profiled confidence interval of a parameter is constructed by holding all other model parameters constant and then examining the likelihood of the the single parameter individually. If the parameter being profiled is correlated with other parameters, the profiled confidence interval assumption of the other parameters being held constant could affect the estimated interval.

Enter the following command in your script and run it.

`confint(gmm)`

The results of the above command are shown below.

`Computing profile confidence intervals ...`

`2.5 % 97.5 % .sig01 1.06869440 3.9476664 (Intercept) -1.33574123 1.6490301 x1 -0.05428904 1.1343119 x2 -2.39358637 -0.8247741`

The confidence interval (95%) for x1 includes 0. This provides additional evidence that the p-value is greater than .05.

A bootstrap test would not be needed for our analysis of the x1 coefficient. The p-value we received from the LRT and confirmed with the profiled confidence interval provide good evidence that x1 is not significantly different from zero in the gmm model. A bootstrap would be used if a more accurate p-value is needed for reporting or if the LRT is inconclusive (a little smaller than the cut off alpha level.) To get a more accurate p-value, one would need to do many more simulation runs than the 1000 that is used in this example.

Enter the following commands in your script and run them.

`pbnmx1 <- pbnm(gmm,gmmDx1,nsim=1000,tasks=10,cores=2,seed=31221743) summary(pbnmx1)`

The results of the above commands are shown below.

`Parametric bootstrap testing: x1 = 0 from: glmer(formula = bin ~ x1 + x2 + (1 | g1), data = pbDat, family = binomial) 1000 samples were taken Thu Aug 18 14:28:18 2016 76 samples had warnings, 51 in alternate model 58 in null model 76 unused samples. 0.085 <= P(abs(x1) > |0.4995956|) <= 0.161`

The p-value from this test is given as a range due to 76 of the simulated null responses resulting in models which were not able to produce final results. Bootstrap tests which produce more than a a few tenths of a percent of models with issues (displayed by pbnm as errors or warning) may suggest that there is not enough data to support the model being fit. Further diagnostics are warranted in these situations. The proportion of problematic runs is high enough in this bootstrap to warrant further investigation of this model.

The range of possible p-values reported from the parametric boot strap is reasonably consistent with the 0.0783, p-value from the LRT test of the coefficient of x1 equaling zero.

The PBmodcomp() function in the pbkrtest package is another function which will do a parametric bootstrap test when the model is linear.

All tests of coefficients have the same accuracy constraints related to the efficiency of the test being done. See the introductory paragraphs of the Test of fixed effects section for a review of these issues.

Wald based tests of coefficients can be done using the linearHypothesis() function. The tests are specified as equations using the names of the coefficients from the model summary. Multiple tests can be done simultaneously. This allows for a single p-value for joint tests from a model.

We will use linearHypothesis() to test if x2 is ten times the negation of x1 (x2 = -10*x1). All coefficients need to be on the left hand side of the equation for the linearHypothesis() function. The test can be rewritten as x2 + 10*x1 = 0. As before we will use the MLE fit model for the LRT test of the restricted model

Enter the following command in your script and run it.

`(LH <- linearHypothesis(gmm,c("10*x1+x2=0") ) )`

The results of the above command are shown below.

`Linear hypothesis test Hypothesis: 10 x1 + x2 = 0 Model 1: restricted model Model 2: bin ~ x1 + x2 + (1 | g1) Df Chisq Pr(>Chisq) 1 2 1 1.4086 0.2353`

The p-value is well above .05.

Linear hypothesis tests can also be done with the KRmodcomp() function, if your model is a linear mixed model. This will provide a more efficient test of the hypothesis than the linearHypothesis() function. The model from our example is a generalized mixed model. We will use the model from the KRmodcomp section above to provide an example of the KRModcomp() function. We will test the same hypothesis.

Enter the following commands in your script and run them.

`L <- cbind(0, 10, 1) (KR2 <- KRmodcomp(mmMLE,L))`

The results of the above commands are shown below.

`F-test with Kenward-Roger approximation; computing time: 0.05 sec. large : y ~ x1 + x2 + (1 | g1) small : L beta = L betaH L= 1 x 3 sparse Matrix of class "dgCMatrix" [1,] . 10 1 betaH= [1] 0 stat ndf ddf F.scaling p.value Ftest 2.0402 1.0000 87.6484 1 0.1567`

This p-value is also greater than .05.

The preferred testing approaches using the LRT or parametric bootstrap require that a contrast be created. The contrast for the above test results in a new variable which is equal to x1 - 10 * x2. The following example shows the code needed to create the contrast variable and do a LRT.

Enter the following commands in your script and run them.

`x1p10x2 <- pbDat$x1 - pbDat$x2*10 gmmCont <- glmer(bin~ x1p10x2 + (1|g1), family=binomial, data=pbDat) anova(gmm,gmmCont,test="Chisq")`

The results of the above commands are shown below.

`Data: pbDat Models: gmmCont: bin ~ x1p10x2 + (1 | g1) gmm: bin ~ x1 + x2 + (1 | g1) Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq) gmmCont 3 112.46 120.27 -53.228 106.46 gmm 4 112.97 123.40 -52.488 104.97 1.4813 1 0.2236`

The p-value from the LRT, 0.2236, is fairly close to the Wald test p-value, 0.2353. A parametric bootstrap could also be done to get a more accurate p-value if needed..

The test which are in common use for the variance parameter are listed from least efficient to most efficient.

LRT. See the discussion in the LRT section above for its description and limitations.

Profiled confidence interval. See the discussion in the Profiled confidence interval section above for its description and limitations.

Test based on Crainiceanu, C. and Ruppert, D. ( 2004 ) for linear mixed models. There is no equivalent for generalized mixed models. The variance parameters of the model must be uncorrelated. There is no equivalent function for when the random variables are correlated, such as with hierarchical models. In R, the exactRLRT() function is an implementation of Crainiceanu and Ruppert.

Parametric bootstrap. See the discussion in the parametric bootstrap section above for its description and limitations.

There is no Wald test of a variance parameter since there is no estimate for the standard error of the variance estimate.

The variance parameter of a generalized mixed models does not have a known asymptotic distribution. The LRT for these variance parameters at times can be poor estimates. We recommend treating these p-values with caution. The LRT test of a variance parameter equalling zero will be conservative (larger p-value). This is due to the test being on a boundary condition (\(\sigma^2 \ge 0\)). Thus if the p-value is small enough to be significant with the LRT test, your finding is likely good. There are some areas were twice the LRT p-value is used as a formal test. We do not recommend this for variance of generalized mixed models since the p-value can be a poor estimate at times.

It the variance parameter being tested is the only variance parameter in the model, the null model will be a fixed effects model. One needs to be sure that the functions used to calculate the likelihoods for the two models use the same constants terms. This is not always the case with R functions.

The gmm model has only one random effect variable, g1. We will test if the random variable g2 is needed using the LRT.

Enter the following commands in your script and run them.

`gmmG2 <- glmer(bin ~ x1 + x2 + (1|g1) + (1|g2), family=binomial, data=pbDat) # LRT calculated using the loglik() function # G2 = -2 * logLik(gmm) + 2 * logLik(gmmG2) pchisq(as.numeric(G2), df=1, lower.tail=F)`

The results of the above commands are shown below.

`[1] 1`

The p-value of 1 is much larger than .05, so the variance associated with the g2 groups is likely to have occurred by chance. To confirm this result, a parametric bootstrap would be used. In the fixed effect section above we saw that our example model generates too many warnings to get an exact p-value from a parametric boot strap (likely due to not having enough data to support the model.)

The exactRLRT() function does not support generalized models as we have in our example model. To demonstrate these functions, we will use the linear version of our example model, mm. The following code tests if the variance for the random effect g1 is zero.

Enter the following command in your script and run it.

`exactRLRT(mm)`

The results of the above command are shown below.

`simulated finite sample distribution of RLRT. (p-value based on 10000 simulated values) data: RLRT = 46.218, p-value < 2.2e-16`

The p-value of 0 is evidence that the variance has a non zero value.

We can also check if the random effect g2 is needed. This is demonstrated with the following code. The exactRLRT() function requires three parameters when the variance which is being tested in not the only random effect in the model. The three parameters are the null model, the m0 parameter, and the alternative model, the mA parameter, and a model object with all of the fixed effects and just the single random effect which is being tested, the m parameter.

Enter the following command in your script and run it.

`mmG2 <- lmer(y ~ x1 + x2 + (1|g1) + (1|g2), data=pbDat) mmR <- lmer(y ~ x1 + x2 + (1|g2), data=pbDat) exactRLRT(m=mmR,mA=mmG2,m0=mm)`

The results of the above command are shown below.

`simulated finite sample distribution of RLRT. (p-value based on 10000 simulated values) data: RLRT = 2.2737e-13, p-value = 0.4273`

The p-value of 0.4273 does not provide evidence that the variance is different from zero and we would retain the null hypothesis assumption of a zero value.

The profiled confidence interval for all of the model parameters was demonstrated in the Test if coefficients are zero section above. See this section for the confidence interval associated with the variance of g1.

To use pbnm() to test if the variance of g1 is greater than zero, we need a null model with the variance equal to zero. The following code constructs a model with the variance of g1 equal to zero and does the parametric bootstrap test.

Enter the following commands in your script and run them.

`gmmDG1 <- glm(bin ~ x1 + x2, family=binomial, data=pbDat) pbgmmDg1 <- pbnm(gmm,gmmDG1,nsim=1000,tasks=10,cores=2,seed=3400835) summary(pbgmmDg1)`

The results of the above commands are shown below.

`Parametric bootstrap testing: (Intercept) | g1 = 0 from: glmer(formula = bin ~ x1 + x2 + (1 | g1), data = pbDat, family = binomial) 1000 samples were taken Thu Aug 18 14:29:04 2016 1 samples had warnings, 1 in alternate model 0 in null model 1 unused samples. 0 <= P(abs((Intercept) | g1) > |4.255377|) <= 0.001`

The p-value of less than or equal to 0.001 provides evidence that the variance of g1 is non zero.

The following code confirms that the results from the exactRLRT() function and the pbnm() function are the same.

Enter the following commands in your script and run them.

`fm <- lm(y ~ x1 + x2, data=pbDat) pbfm <- pbnm(mmMLE,fm,nsim=1000,tasks=10,cores=2,seed=98731307) summary(pbfm) pbmmG2 <- pbnm(mmG2,mm,nsim=1000,tasks=10,cores=2,seed=4642782) summary(pbmmG2)`

The results of the above commands are shown below.

`Parametric bootstrap testing: (Intercept) | g1 = 0 from: lmer(formula = y ~ x1 + x2 + (1 | g1), data = pbDat, REML = FALSE) 1000 samples were taken Thu Aug 18 14:29:17 2016 P(abs((Intercept) | g1) > |8837.818|) = 0`

`Parametric bootstrap testing: (Intercept) | g2 = 0 from: lmer(formula = y ~ x1 + x2 + (1 | g1) + (1 | g2), data = pbDat) 1000 samples were taken Thu Aug 18 14:29:48 2016 P(abs((Intercept) | g2) > |3.83661e-11|) = 0.477`

These p-values are with in round errors of the p-values returned from exactRLRT() calls. It is quicker to use the exactRLRT() when you can. When exactRLRT() assumptions are not met, the pbnm() function can provide the p-value. The pbnm() function is slower. But it requires fewer assumptions and as such can be used for a greater variety of tests.

Tests of variance other than being equal to zero, would require a user constructed parametric bootstrap. This is beyond the intended scope of this article.

Use the models from sleepstudy data set for the following exercises.

Test the significance of the random parameter in your model from exercise 5 in the Models article.

Test the significance of the random parameter in your model from exercise 6 in the Models article.

Previous: Models

Next: Diagnostics and Inferences

Last Revised: 3/23/2016