Multiple Imputation in Stata:
Managing Multiply Imputed Data

This is part five of the Multiple Imputation in Stata series. For a list of topics covered by this series, see the Introduction.

In many cases you can avoid managing multiply imputed data completely. Wherever possible, do any needed data cleaning, recoding, restructuring, variable creation, or other data management tasks before imputing. Because this is not always possible, the mi framework includes tools for managing multiply imputed data. However, in practice we rarely see them used. (This section may be expanded in the future as issues arise.)

mi Versions of Data Management Commands

Once you mi set your data and add imputations to it, the imputed values are added to the data set as either additional observations or additional variables, depending on which structure you chose. Commands which do not take that into account may or may not give correct results. The mi versions of basic data management commands do take the mi structure into account, ensuring that the changes you make are applied to all the imputations properly. These include mi merge, mi append, mi expand, mi rename, and many others (see the mi documentation). For all these commands the syntax for the mi version is identical to that of the regular version other than having some additional options available which are related to multiply imputed data.

The command that's most commonly needed is mi reshape. Panel data (subjects observed over time) should be imputed in wide form where there is one observation per subject rather than one observation per subject per time period. That way the imputation model for a given variable in a given period can include values of the same variable in other periods, which are likely to be good predictors. However, analysis often requires the long form, where there is one observation per subject per period. Before imputing, switch to the wide form with:

reshape wide...

After imputing, switch back with:

mi reshape long...

The two commands will be identical except for adding mi and changing wide to long.

See the Hierarchical Data section of Stata for Researchers for more discussion of reshape and long vs. wide forms.

Setting mi Data

Commands like svyset, tsset, and xtset also have mi versions: mi svyset, mi tsset, mi xtset, etc. If you set your data before imputing (using the regular version of the command) it will still be set after imputing. If you need to set it after imputing, use the mi version.

Keep in mind that mi impute chained cannot correct for survey structure. See the section on Survey Data in Creating Imputation Models for more discussion.

mi update

Certain things should always be true in multiply imputed data sets. For example, regular (unimputed) variables should have the same values in all imputations. The mi update command will check that this is so and fix any problems it finds (in this case, by setting the value in all imputations to the value in the observed data).

You may never need to run an mi update yourself, because each mi command that changes your data also runs mi update afterwards. While this is automatic, you should be aware of it for two reasons. First, if you're running a string of data management commands there's no need to do an mi update after each one. If your data set is big enough that the process is taking significant time, consider adding the noupdate option to all but the last command. Second, if you introduce super-varying variables or make other changes that mi update could find problematic, you need to be sure that mi update won't change your data inappropriately. For super-varying variables, that just means you shouldn't register them (see the section on super-varying variables). But if you're making complicated data changes you should read the mi update documentation carefully so that you know what it will do.

On the other hand, some data management commands do not have an mi version (drop is an example). You should run mi update yourself after using one of them.

mi xeq

The mi xeq command allows you to act on your imputations one at a time. Since we needed this capability to check on our imputation results it was introduced in Imputing. Be sure you're familiar with it before continuing.

Creating or Changing Variables

The process for creating or changing a variable depends on whether the variable is a regular variable, a passive variable, or a super-varying variable.

Regular Variables

New or changed variables that are functions of existing regular variables are also regular variables. They will have the same value in every imputation. You can create new regular variables or change the values of existing regular variables using mi xeq plus the standard gen, egen, and replace commands. You should register new regular variables as such, though it's not required.

mi xeq: gen lnIncome=ln(income)
mi register regular lnIncome

(Assuming income is regular, not imputed.)

Passive Variables

Passive variables are functions of imputed variables. Thus they will have different values in different imputations. They can be created or changed using mi passive followed by the standard gen or replace commands. However, mi passive should only be used with egen for functions that depend solely on the current observation, like rowtotal(). Functions like total() or mean() create super-varying variables. Using mi passive automatically registers new variables as passive.

mi passive: gen lnIncome=ln(income)

(Now assuming income is imputed.)

Passive variables are not automatically changed if the variables they are based on change. If you need to update passive variables, the easiest way is probably to drop the existing versions and then rerun the commands that created them in the first place.

Passive variables are often problematic—the examples on transformations, non-linearity, and interactions show how using them inappropriately can lead to biased estimates.

Super-varying Variables

Normally, if a case is complete (has no missing values) it will be identical in all imputations. But consider a household income variable which is the total of all the individual incomes in the household: if one person's income is missing and must be imputed, then household income for everyone in that person's household will be different in each imputation, even people who are complete cases. Variables with the property that they vary between imputations even for complete cases are known as super-varying variables. Variables that are functions of the values of imputed variables for other observations are likely to be super-varying. Functions that depend on values in other observations include most egen functions, but also gen or replace expressions that use square brackets (x[1] or x[_n+1], for example).

If you need to create super-varying variables, switch to the flong format (or flongsep if you don't have enough memory for flong). In flong, there is one copy of each observation for each imputation, even for complete cases. Thus super-varying variables can have different values in each copy. The mlong and wide formats save memory by storing just one copy of complete cases, but this makes them unable to store super-varying variables.

Super-varying variables should be created or changed using mi xeq and the standard gen, egen, or replace commands (most likely egen).

mi convert flong, clear
mi xeq: egen householdIncome=total(income)

Super-varying variables must not be registered. While they are theoretically passive variables, registering them as either passive or regular will prompt mi update to apply the (normally true) constraint that complete cases do not vary between imputations and "correct" their values. Leaving super-varying variables unregistered makes mi update leave them alone, but they'll still work in estimation commands.

All the statistical concerns raised by passive variables also apply to super-varying variables.

mi extract

Sometimes you need to work with an individual imputation as a regular data set, ignoring the fact that it was imputed. Also, some special purpose software like HLM can work with multiply imputed data but expects that you will put each imputation in a separate file.

The tool for doing selecting a single imputation and turning it into a regular data set is mi extract, and the syntax is very simple:

mi extract n

where n is the number of the imputation you want to work with. After extraction, the data will not be mi set and there will be no indication it was ever imputed. n can be 0, in which case mi extract gives you the observed data, missing values and all.

HLM reads SPSS files, not Stata files, but you can call on Stat/Transfer to convert your data sets to SPSS format. If you have 10 imputations, the following code will extract each imputation, save it as a separate data set, then have Stat/Transfer convert it to SPSS format:

forval i=1/10 {
preserve
mi extract `i'
save hlm`i',replace
! "c:\program files (x86)\stattransfer11\st.exe" "hlm`i'.dta" "hlm`i'.sav"
restore
}

The command to call Stat/Transfer is written to work on Winstat. If Stat/Transfer is located in a different directory on your computer, you will need to modify that line accordingly. On Linstat it can be simply:

! st hlm`i'.dta hlm`i'.sav

mi import ice

If you have imputed data using ice, mi import ice will convert it from ice format to mi format, allowing you to use mi commands. The automatic option tells mi import ice to register all the variables and is highly recommended. It can tell which variables are regular by noting which ones never change between imputations, but it cannot distinguish between imputed and passive variables. If you have passive variables, use the passive() option and list them in the parentheses. If mi import ice finds that a variable is not regular and it is not listed in a passive() option, then it will mark the variable as imputed.

Example:

mi import ice, automatic passive(passiveVar1 passiveVar2)

Data imported from ice will be placed in the flong form, since that's essentially what ice uses. We suggest converting to wide or perhaps mlong (unless you have super-varying variables, which require flong):

mi convert wide, clear

Next: Estimating

Previous: Imputing

Last Revised: 1/14/2013