*This is part two of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction.*

The decision to use multiple imputation rather than simply analyzing complete cases should not be made lightly. First, multiple imputation takes a substantial amount of time to learn, and a substantial amount of time to implement. Expect to spend at least as much time on your imputation model as on your analysis model—the model whose results you are interested in for your research. Second, the practical technique for doing it in the social sciences, multiple imputation by chained equations or MICE, lacks theoretical justification and using it may draw objections from some reviewers (of course not using it may draw objections from other reviewers). Third, it's quite possible to do it wrong and thus get invalid results without realizing it. This leads to a dilemma: if multiple imputation gives different results than complete case analysis, which will you believe? (And if it doesn't, what was the point?) Clearly you'll need to make sure you understand multiple imputation well enough to be confident you're using it properly.

On the other hand, complete cases analysis has substantial weaknesses as well. There is no single right answer to the question of how to handle missing data.

This series will focus almost exclusively on Multiple Imputation by Chained Equations, or MICE, as implemented by the mi impute chained command. We recognize that it does not have the theoretical justification Multivariate Normal (MVN) imputation has. However, most SSCC members work with data sets that include binary and categorical variables, which cannot be modeled with MVN. (There are ways to adapt it for such variables, but they have no more theoretical justification than MICE.) We will not discuss monotone or univariate imputation methods because we have yet to see an SSCC member with monotone data or just one variable to impute.

For many years Patrick Royston's ice command was the standard implementation of MICE in Stata. We express our appreciation for his contribution to the Stata community. However, the new mi impute chained command has all the resources of Stata Corporation behind it, works directly with the mi framework for handling imputed data, and we feel it is somewhat easier to learn and use. We'll thus use mi impute chained throughout this series and we suggest ice users switch to it.

Issues you should consider when deciding whether to use multiple imputation or not include the following:

The reason you probably considered multiple imputation in the first place was to avoid losing observations because they contain missing values. Multiple imputation allows you to use what information is available in those observations that contain missing values, which can lead to smaller confidence intervals and more ability to reject null hypotheses.

In the multiple imputation literature, data are "missing completely at random" (MCAR) if the probability of a particular value being missing is completely independent of both the observed data and the unobserved data. In other words, the complete cases are a random sample. If the data are MCAR, then both complete cases analysis and multiple imputation give unbiased estimates.

If the probability of a particular value being missing depends only on the observed data, then the data is "missing at random" (MAR) and the complete cases are not a random sample. With MAR data, complete cases analysis gives biased results but multiple imputation does not. If you believe your data are MAR rather than MCAR, then you should definitely consider using multiple imputation.

If the probability of a particular value being missing depends on the unobserved data, then the data are "missing not at random" (MNAR). In theory multiple imputation can give unbiased estimates with MNAR data, but only if the imputation method includes a model of the missingness mechanism. You'd need to code such a method yourself; it cannot be done using mi impute, ice, etc. In practice, if your data are MNAR it's going to be very hard to carry out legitimate analysis.

Note that MCAR and MAR do not require that the probability of one value being missing be independent of the probability of another value being missing. Missing values are often linked. For example, if a person was not contacted in a survey wave, that person will be missing all the variables from that wave but that data could still be MCAR, MAR or MNAR.

Example: MCAR vs. MAR vs. MNAR

Testing whether a given data set is MCAR or MAR is straightforward. First create a new indicator variable for each existing variable which is 1 if a given observation is missing that variable and 0 if it is not. The misstable command can do this part automatically with the gen() option. Then run logit models to test if any of the other variables predict whether a given variable is missing. If they do, then the data is MAR rather than MCAR.

If you had variables y, x1 and x2, the code would look like:

misstable sum, gen(miss_)

logit miss_y x1 x2

logit miss_x1 y x2

logit miss_x2 y x1

It would also be a good idea to run t-tests to see if the values of the other variables vary between missingness groups:

ttest x1, by(miss_y)

ttest x2, by(miss_y)

ttest y, by(miss_x1)

ttest x2, by(miss_x1)

ttest y, by(miss_x2)

ttest x1, by(miss_x2)

The following code automates this entire process:

local numvars list of all numeric variables in data set

local missvars list of all variables with missing values in data set

misstable sum, gen(miss_)

foreach var of local missvars {

local covars: list numvars - var

display _newline(3) "logit missingness of `var' on `covars'"

logit miss_`var' `covars'

foreach nvar of local covars {

display _newline(3) "ttest of `nvar' by missingness of `var'"

ttest `nvar', by(miss_`var')

}

}

If you have a lot of variables and can put them into a convenient *varlist* (like x1-x10 or even _all) replace the two initial local commands with unab:

unab numvars: numeric variables as varlist

unab missvars: variables with missing values as varlist

There is no formal test for determining whether a given set of logit results means the data is MCAR or MAR, but they will give you a sense of how close the data are to MCAR and how big a problem the deviations from MCAR are likely to be. The bigger the deviation the stronger the case for using multiple imputation rather than complete cases analysis.

By definition you cannot determine whether data are MNAR by looking at the observed values. Think carefully about how the data was collected and consider whether some values of the variables might make the data more or less likely to be observed. For example, people with very high or very low incomes might be less willing to disclose them, or people with high BMIs. People with a strong interest in the topic of a survey might be more likely to respond than those who care less. Schools might try very hard to make sure students they expect to do well take standardized tests but put much less effort into having students they expect to do poorly take them. In the last example, adding variables like grades or socioeconomic status that predict test performance and thus probability of taking the test might make the data plausibly MAR.

If you have low amounts of missing data (say, 1%) then multiple imputation and complete cases analysis are very likely to give essentially the same results, and complete cases analysis is much easier.

On the other hand, if you have very large amounts of missing data then your final results will be driven in large part by your imputation model rather than the observed data. There's no consensus on how much missing data is too much for multiple imputation, but certainly imputing 50% of your data is asking for trouble.

Multiple imputation is useful somewhere in between.

Next: Creating Imputation Models

Previous: Introduction

Last Revised: 1/13/2013