Multiple Imputation using ICE in Stata

Imputation using Chained Equations, or ICE, is by far the most popular method of multiple imputation at the SSCC. Historically this has been done in Stata using Patrick Royston's ice program. Stata 11 introduced an official framework for working with multiply imputed data, the mi commands, but because it didn't include ICE as an imputation method most SSCC members continued to use ice instead. Now, however, Stata 12 has an official command for carrying out ICE (mi impute chained) giving SSCC users both the flexibility of ICE and the efficiency of mi.

This article will introduce you to the basics of mi and mi impute chained, with the goal of allowing you to get preliminary results as quickly as possible. If you are already familiar with ice, it may be enough to allow you to make the switch to mi. However, it is not a replacement for the official Stata mi documentation (which is excellent) or a thorough study of the literature on ICE and multiple imputation. A good starting point is Multiple imputation using chained equations: Issues and guidance for practice by White, Royston and Wood (yes, the Royston who wrote ice) in the November 2010 issue of Statistics in Medicine, though you should keep in mind that it was written before Stata 12 made mi impute chained available.

Should I use Multiple Imputation and ICE?

ICE is a controversial technique. The Stata 12 documentation notes "A concern with ICE is its lack of a formal theoretical justification" but also "Despite the lack of a general theoretical justification, ICE is very popular in practice. Its popularity is mainly due to the tremendous flexibility it offers for imputing various types of data arising in observational studies." In the social sciences, where it's normal for binary variables, categorical variables, and continuous variables to coexist in a single data set, the choice is usually between multiple imputation using ICE and not using multiple imputation at all.

We are concerned, however, that some SSCC researchers are using ICE without spending enough time learning about the technique. Thus they may not fully appreciate its strengths and weaknesses, the conditions the data must meet for it to be valid, or the considerations involved in constructing imputation models. The use of multiple imputation naturally leads to two questions:

  1. If multiple imputation changes your results, can you be sure that this is not the result of a problem with the imputation process?
  2. If multiple imputation doesn't change your results, what's the point of using it?

Answering question one clearly requires a solid understanding of the theory of multiple imputation.

The answer to question two is often "My reviewer/advisor/committee member said I should try multiple imputation." In a situation like that, some quickly-generated preliminary results may be useful in deciding whether it's worthwhile to invest the time required to learn multiple imputation and ICE well enough to rely on them. This article will help you get those preliminary results.

Preparing to Impute

If possible, do all your variable selection, data cleaning, recoding, etc. before you do your imputation. The exception is variables that are derived from variables you intend to impute (what Stata calls passive variables), which must be calculated after imputing.

mi set

The first step in using mi is to mi set your data. This is somewhat similar to svyset, tsset or xtset.

The mi set command tells Stata how it should store the additional imputations. The best choices are usually wide or mlong.

To have Stata use the mlong data structure, type:

mi set mlong

To have Stata use the wide data structure, type:

mi set wide

In the mlong form, Stata will create a new copy of each observation for each imputation, but only for observations that actually have missing data (this makes it more efficient than ice, which uses a similar structure but copies all observations). In the wide form, Stata will create a new copy of each variable for each imputation, but only for those variables that have missing data. If your data set is big enough that memory usage is a factor, consider whether you have a higher proportion of observations with missing values or variables, because that will determine which form must add more to your data set. Once a structure is chosen, if you only change your data using mi commands mi will take care of applying the changes to all the imputations and you can forget about which structure you used.

The wide vs. long terminology is borrowed from that used by reshape for working with hierarchical data, and the structures are similar. However, they are not equivalent and you would never use reshape to change the data structure used by mi. Instead, type mi convert wide or mi convert mlong.

Registering Variables

The mi commands recognize three kinds of variables:

Imputed variables are variables that mi is to impute or has imputed.

Regular variables are variables that mi is not to impute, either by choice or because they are not missing any values.

Passive variables are variables that are completely determined by other variables. For example, log wage is determined by wage, or an indicator for obesity might be determined by weight and height. If you need to transform a variable to make it closer to normal, impute the transformed variable and make the original variable passive.

If a passive variable is determined by regular variables, then it can be treated as a regular variable since no imputation is needed. Passive variables only have to be treated as such if they depend on imputed variables.

Registering a variable tells Stata what kind of variable it is. Imputed variables must always be registered:

mi register imputed varlist

where varlist should be replaced by the actual list of variables to be imputed.

Regular variables often don't have to be registered, but it's a good idea:

mi register regular varlist

Passive variables must be registered:

mi register passive varlist

However, you'll often create passive variables after imputing. If you do so properly (i.e. with mi passive) then they'll be registered as passive automatically.

Imputing

The basic syntax for mi impute chained is:

mi impute chained (method1) varlist1 (method2) varlist2..., add(N)

where each method specifies the method to be used for imputing the following varlist and N is the number of imputations to be added to the data set. (Yes, that means you don't have to do all your imputations at the same time, a fact we take advantage of in Speeding up Multiple Imputation in Stata using Parallel Processing).

The possibilities for method are regress, pmm (predictive mean matching), truncreg (truncated regression), intreg (interval regression), logit, ologit, mlogit, poisson, and nbreg. See the documentation (help mi impute chained) for far more details.

If you want to use a variable in the imputation model but not impute it, place it at the end of the varlists following an equals sign. For example, the following will use x3 in the models for imputing x1 and x2, but not impute it:

mi impute chained (regress) x1 x2=x3

This only works, however, if x3 does not have missing values. If it does, you should impute it and then optionally not use those observations where x3 is missing in your final analysis. (See Not Imputing the Dependent Variable below.)

Analyzing Imputed Data

Analyzing imputed data is very simple: just put mi estimate: in front of whatever model you want to run:

mi estimate: regress y x

Stata will take care of running the model on each imputation and then combining the results using Rubin's Rules.

An Example

Consider the following data set:

use http://www.ssc.wisc.edu/sscc/pubs/files/stata_mi/midata.dta

It contains fictitious data on the gender, race, education, experience, and wages of 1,000 individuals. The variable female is an indicator variable, race and education are categorical (but education is ordered), experience and wage are continuous.

The goal is to regress wage on the other variables:

reg wage female i.race i.edu exp

Note that the data set contains a significant number of missing values, making multiple imputation attractive. Here is complete code for creating the imputations and then analyzing them:

mi set wide
mi register imputed female-wage
mi impute chained (logit) female (mlogit) race (ologit) edu (regress) exp wage, add(10)
mi estimate: regress wage female i.race i.edu exp

Not Imputing the Dependent Variable (Using Regular Variables)

Some researchers prefer not to impute the dependent variable of their final model. However, it is generally agreed that the dependent variable must be included in the imputation models (along with all the other variables used in the analysis). The solution is to let mi impute chained impute the dependent variable, but then exclude those observations where the dependent variable was missing from the final analysis.

To do so, create an indicator variable touse which is 1 for observations with valid values of the dependent variable and 0 for observations with a missing value for the dependent variable. Then add if touse to your final model. Here's a modified version of the code above which does so:

use http://www.ssc.wisc.edu/sscc/pubs/files/stata_mi/midata.dta, replace
gen touse=(wage<.)
mi set wide
mi register imputed female-wage
mi register regular touse
mi impute chained (logit) female (mlogit) race (ologit) edu (regress) exp wage, add(10)
mi estimate: regress wage female i.race i.edu exp if touse

Note that touse is a regular variable: it is never missing and does not need to be imputed.

Passive Variables

Suppose you found that the log of wage is closer to normally distributed than wage itself, making it a better variable to use in both the imputation models and the final analysis. However, you still want to be able to determine the imputed value of wage. This makes wage a passive variable that depends on the imputed variable logwage.

use http://www.ssc.wisc.edu/sscc/pubs/files/stata_mi/midata.dta, replace
gen logwage=log(wage)
mi set wide
mi register imputed female-exp logwage
mi register passive wage
mi impute chained (logit) female (mlogit) race (ologit) edu (regress) exp logwage, add(10)
mi passive: replace wage=exp(logwage)
mi estimate: regress logwage female i.race i.edu exp
mi estimate: mean wage logwage

By putting mi passive: in front of the replace command you're telling Stata to change the variable in all of the imputations. In the wide structure, the imputed values of wage are stored in _1_wage, _2_wage, etc. (along with the original values in plain old wage) but you don't need to know or care about that because mi passive: takes care of it for you.

This is the strength of mi: commands like mi passive, mi merge and mi reshape allow you to manage your data set almost as if it had not been multiply imputed.

Examining Imputations Individually (mi xeq)

You can execute commands on each imputation individually using the mi xeq prefix. One common use is to get summary statistics for each imputation as a check to see whether the distribution of the imputations matches the distribution of the original data:

mi xeq: sum wage

The m=0 data is the original, unimputed data. The others are the individual imputations.

You can specify which imputations the command should act on by listing them after the mi xeq command (but before the colon). For example, you can get summary statistics for just the original data with:

mi xeq 0: sum wage

To examine all the imputed data but not the original data, you need to list all the imputations except 0. A numlist will be useful here:

mi xeq 1/10: sum wage

Stata recognizes this as a place for a numlist, so 1/10 is interpreted as "the integers from 1 to 10" not "1 divided by 10."

Running Multiple Commands for each Imputation

If you need to run multiple commands for each imputation, keep them on one line but separated by semi-colons (note that this will not work if you've changed the delimiter to semi-colon). The following example estimates the mean for each imputation, then combines them using Rubin's Rules:

scalar m=0
mi xeq 1/10: mean wage; matrix b=e(b); scalar m=m+b[1,1];
display "Estimated mean="m/10

As a check, compare with:

mi estimate: mean wage

Of course there's no point in writing your own code to use Rubin's Rules for means, but if you need to estimate some quantity not included in the standard mi estimate output this technique could be vital. For more information, see Stata's FAQ How can I combine results other than coefficients in e(b) with multiply imputed data?

Prediction

You can make predictions based on models run with multiply-imputed data using mi predict. The predicted value is treated like any other parameter of interest: a prediction is made from each imputation and they are combined using Rubin's Rules. However, the resulting predictions should not be used as input for other calculations. Instead, those calculations must also be carried out for each imputation and the results combined using Rubin's Rules.

The big difference between mi predict and regular predict is that mi predict requires that the prior regression command save its results in a file mi predict can use. Thus:

mi estimate, saving(results, replace): regress wage female i.race i.edu exp
mi predict wagehat using results

The saving option (and note that it goes with mi estimate, not the regression command) saves a file called results.ster. The using part of mi predict tells Stata to retrieve and use those results.

mi predict calculates is a single prediction for each observation--the predicted value does not vary between imputations.

Another important difference is that mi predict does not automatically calculate predicted probabilities after a logit model. For more information, including how to calculate the predicted probabilities, see the PDF documentation on mi predict.

Learning More

Hopefully this article has given you enough information to get started using multiple imputation in Stata. UCLA has a more detailed introduction as one of their online Statistical Computing Seminars. The Stata documentation on mi is both clear and thorough, and includes references to the literature on the topic. To read it, type help mi in Stata, then click on the link at the top that says [MI] intro. This will open the full PDF documentation on multiple imputation. Please keep in mind that before relying on results from using multiply imputed data you should study the theory of multiple imputation, not just how to do it in Stata.

Last Revised: 12/2/2011