Creating Imputation Models

*This is part three of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction.*

In theory, an imputation model estimates the joint distribution of all the variables it contains. MICE breaks this problem into a series of estimations that regress one variable on all the other variables in the model. (The downside is that a series of models of the distributions of individual variables does not necessarily add up to a consistent model of the joint distribution.)

The mi impute chained command does not require you to specify the model for each variable separately: you just list the variables to be imputed along with information about how they should be imputed, and mi impute chained will form the individual models automatically. However, the success of the overall imputation model depends on the success of all the individual models. If a single model fails to converge, the imputation process as a whole will fail. If a single model is misspecified, it may bias the results of your analysis model. **We strongly recommend that you run each of the individual models on its own, outside the context of mi impute chained, to test for convergence and misspecification.** We'll discuss the details of doing so in the next section. This section will focus on issues you must consider in creating your imputation models.

The first step in creating an imputation model is deciding which variables to impute. The imputation model should always include all the variables in the analysis model. This includes the dependent variable of your analysis model, though there is some debate about whether the imputed values of the dependent variable should be used. Even if you don't plan to use the imputed values of the dependent variable, the observed values of the dependent variable provide information about the other variables, and the information available from those observations which are missing the dependent variable should be used in the imputation model as well.

Example: Imputing the Dependent Variable

The imputation model should include any other variables that provide information either about the true values of the missing data or about their probability of being missing. Avoid creating a "kitchen sink" model however. Large numbers of variables, especially categorical variables, can lead to models that fail to converge. Use theory to guide you in choosing appropriate variables.

You can add variables to the imputation model that do not need to be (or shouldn't be) imputed by putting them at the end of the variable list following an equals sign.

You can add variables to or remove variables from the imputation model for an individual variable or group of variables using the include() or omit() options. The include() option even allows you add expressions to a model such as (x^2), but they have to go inside an additional set of parentheses (e.g. include((x^2)) ). These options go with the imputation method for a variable or variables (e.g. (regress, include(x)) ) rather than at the end of the mi impute chained command.

Be cautious about adding expressions to imputation models: if y depends on some function of x, then x should depend on the inverse function of y and failing to model both can bias your results. See Non-Linear Terms for further discussion.

If you have data where units are observed over time, the best predictors of a missing value in one period are likely the values of that variable in the previous and subsequent periods. However, the imputation model can only take advantage of this information if the data set is in wide form (one observation per unit, not one observation per unit per time period). You can convert back to long form after imputing if needed.

To convert the data to wide form before imputing, use reshape. To convert back to long form after imputing, use mi reshape. This has the same syntax as reshape, but makes sure the imputations are handled properly. If you're not familiar with reshape, see the Hierarchical Data section of Stata for Researchers.

The mi estimate: and svy: prefix commands can be used together (in that order) to run models on survey data that have been multiple imputed. However, svy: cannot be used with mi impute chained. You can apply weights (e.g. [pweight=weight]) but not correct for other elements of survey structure like strata or PSU. The current recommendation is to include survey structure variables like strata and PSU in the imputation models as sets of indicator variables (e.g. i.psu). This is an area of ongoing research.

When you test your individual imputation models, we suggest running them first with the svy: prefix and then without it but with weights applied and survey structure variables added to the model. If the two give very different results, try adding interactions between the survey structure variables or additional variables related to survey structure. If they continue to give very different results despite your best efforts, be wary about using multiple imputation.

There are nine methods available for imputing a variable: regress, pmm, truncreg, intreg, logit, ologit, mlogit, poisson and nbreg. In most cases you'll choose the same imputation method you'd choose if you were going to model the variable normally: regress for most continuous variables, logit for binary variables, mlogit for unordered categorical variables, etc.

Keep in mind that the standard regress implies a normal error term after controlling for the covariates. If you have a continuous variable that is not normal, regress may not give you a distribution of imputed values that matches the observed values very well.

An alternative is Predictive Mean Matching (PMM). PMM is an ad hoc technique with little theory behind it, but it seems to work quite well in practice. PMM starts out by regressing the variable to be imputed on the covariates, and then drawing a set of coefficients from the results, taking into accout both the estimated coefficients and the uncertainty about them. Those coefficients are used to calculate a predicted value for all missing values. However, it then uses the predicted value for a given observation to identify those observations whose observed value of the variable are close to the predicted value and chooses one of them randomly to be the imputed value. If the observed values of a variable are not normal, PMM will usually produce a distribution of imputed values that matches the distribution of the observed values more closely than regression.

The knn() option controls how many observations are considered as matches (based on their observed values of the variable being close to the predicted value for the observation being imputed). Recent work by Morris, White and Royston indicates that larger numbers of observations should be used than was standard practice in the past. They suggest at least 10, and more if your data set is very large (tens of thousands of observations or more).

Because PMM draws its imputed values from the observed values, it has the property that the imputed values will never be outside the range of the observed values. This makes it very useful for bounded variables (discussed below). It can also be used for some non-continuous distributions. However, PMM is not appropriate if you have reason to believe the unobserved values are outside the range of the observed values.

Skewed variables may be made more normal by transformations such as taking the log. However, you should consider how this affects the relationships between variables. For example, if you have variables for "income" and "spending on entertainment" and you believe the relationship between the two is linear, replacing "income" with "log income" makes the imputation model for both variables misspecified.

Another common situation is bounded variables. For example, "hours worked" cannot go below zero, and percentages must be between zero and 100. Such variables can be imputed using truncreg. The ll() and ul() options contain the lower limit and upper limit for the variable, which can be either numbers or variables. You are not required to specify both (e.g. hours worked probably only needs ll(0), unless you're worried that the model might try to have someone work more than 168 hours per week). Unfortunately, in our experience it's not unusual for truncreg to have convergence problems in imputation models with many variables.

PMM is a good alternative to truncreg because it naturally honors any bounds that exist in the observed data.

If your analysis model contains non-linear terms, most likely variables squared, then this must be taken into account when creating your imputation model. Suppose your analysis model regresses y on x and x^2. If you just impute y and x, creating x^2 later (either with mi passive or c.x#c.x), then the imputed values of y will only depend on x and the imputed values of x will depend linearly on y. When you run your analysis model, the coefficient on the squared term will be biased towards zero because for observations where either y or x is imputed, y really is unrelated to x^2. (Never forget that when you write your mi impute chained command you are building models, not just listing variables to impute.)

The best alternative appears to be what White, Royston and Wood call the "Just Another Variable" approach. Create new variables to store the non-linear terms (e.g. gen x2=x^2) and then impute them as if they were just another variable, unrelated to the linear terms. The imputed values of the non-linear terms won't have the proper relationship to the linear terms (i.e. the imputed values x2 will not in fact be x^2) but as long as they are distributed properly this does not appear to affect the results of the analysis model. This is an area of ongoing research.

Interaction terms raise issues very similar to those raised by non-linear terms: if the interaction term isn't included in the imputation model, the coefficient on the interaction term will be biased towards zero in the analysis model. The "Just Another Variable" approach also works well for interaction terms: create variables storing the interaction effects (e.g. gen gx=g*x) and then impute them separately.

If, however, the interactions involve binary or categorical variables that represent groups, consider instead using the by() option to impute each group separately. This allows coefficients to vary between groups without the problem of imputed interaction terms not actually matching the variables being interacted.

For example, suppose you're regressing income on education, experience, and black (an indicator for "subject is black"), but think the returns to education vary by race and thus include black##c.education in the regression. The just another variable approach would create a variable edblack=black*race and impute it, but it's possible for the model to impute a zero for black and a non-zero value for edblack. There's no indication this would cause problems in the analysis model, however.

An alternative would be to add the by(black) option to the imputation command, so that whites and blacks are imputed separately. This would allow you to use black##c.education in your analysis model without bias (and it would always correspond to the actual values of black and education). However, running two separate imputation models allows the returns to experience to vary by race in the imputation model, not just education. If you had strong theoretical reasons to believe that was not the case (which is unlikely) that would be a specification problem. A far more more common problem is small sample size: make sure each of your by() groups is big enough for reasonable regressions.

Trying to use "Just Another Variable" for interactions between categorical variables and imputing them with logit is problematic. Use by() instead.

If you have a set of mutually exclusive indicator variables, use them to create a single categorical variable and then impute it using mlogit. For example, combine white, black, hispanic, other into race, or highSchool, someCollege, bachelors, advanced into education. You can recreate the indicator variables after imputing, either with mi passive or by simply using i.race or i.education in your models.

If you impute the indicator variables themselves using logit, the imputation model will not impose the constraint that only one of them can be one. Thus you'll likely get people with more than one race or more than one education level. By converting the indicators to a categorical variable and imputing the categorical variable using mlogit you force the model to choose just one category.

Next: Imputing

Previous: Deciding to Impute

Last Revised: 9/15/2015