Speeding up Multiple Imputation in Stata using Parallel Processing

Multiple imputation is computationally intensive and can be time consuming if your data set is large. On the other hand, the process of creating each imputation is independent of the others, which means you can have multiple CPUs working on different imputations at the same time and then combine them when they're all complete. This article will show you how to do so automatically using the SSCC's Condor flock, but the technique can be used on any computer with multiple CPUs (see What if I don't have access to a Condor flock?).

This article uses macros and loops. If you're not familiar with them, you should probably read Stata Programming Essentials before proceeding. You'll also need to run Stata programs on the SSCC's Linux servers. If you've never used Linux before this is easier than you probably think--see Using Linstat for instructions. You may want to read An Introduction to Condor, though this article will teach you all the Condor commands you'll need.

Stata users who want to do multiple imputation can choose between Stata's official mi commands and the user-written ice. At this point, mi (in particular mi impute chained) can do everything ice can do and we recommend everyone use mi. However, many people are still used to using ice. Fortunately you can use ice for the actual imputation and then convert the data set to mi's format so you can do things like combine imputations that were done separately. This article will discuss using both: look for mi and ice sections describing the commands that are needed only if you're using one or the other.

We'll introduce three do files. setup.do prepares the data and submits all the imputation jobs to Condor. impute.do is run by Condor--multiple times in parallel--and does the actual imputation. combine.do then combines the results into a single file. Full code for all three do files, with mi and ice versions, can be found at the end of this article. As written, the do files use an example data set we've made available at http://www.ssc.wisc.edu/sscc/pubs/files/missing_data.dta. It contains id, y and variables x1 through x10. y and all the x's have missing values. If you want to run the do files as written, make a directory for your work, make it your current directory and then place a copy of this data set in it. The easiest way is probably to run the following in Stata:

use http://www.ssc.wisc.edu/sscc/pubs/files/missing_data.dta
save missing_data

Alternatively you can adapt the do files to use your data immediately. For this technique to work your data set must have a unique identifier variable, so if your data set does not have one you'll have to create it (gen id=_n will probably do).

Setting Up (setup.do)

Begin with the usual housekeeping (clearing the memory, starting a log, etc.).

mi

mi requires some configuration before you can do any imputing. Begin by loading the data. Then choose the data structure mi should use with the mi set command. If you need to manipulate your data after imputing you should learn what the different formats are and when each should be used--type help my_styles. If not, it makes little difference which you use so use mlong. Next, register the variables you wish to impute and save the result in a new file.

use missing_data
mi set mlong
mi register imputed x* y
save missing_data_mi, replace

Both

Now it's time to submit the imputation jobs to Condor. For this example we'll create ten imputations by submitting ten jobs that create one imputation each. In fairness to other users do not submit more than fifteen jobs, but if you want more than fifteen imputations have each job create more than one imputation. The following code submits the jobs:

forvalues jobID=1/10 {
shell condor_stata impute `jobID' &
}

The shell command tell Stata to have Linux execute what follows. condor_stata impute tells Condor to run the Stata job impute.do (the .do is implied). `jobID' passes the current value of the jobID macro to impute.do as an argument--you'll see how impute.do uses it shortly. The & at the end tells Linux to run the condor_stata command in the background, so Stata doesn't have to wait for it to finish before proceeding with the loop.

Imputing (impute.do)

When setup.do is complete, ten instances of impute.do have been submitted to Condor and Condor will find CPUs for all of them. They differ in that each one has a different number for its jobID argument.

Since impute.do will always run in batch mode, it doesn't need to clear memory and such. Its first task is to retrieve its jobID. Do so with the args command:

args jobID

This stores the do file's argument in a macro called jobID. You can then use it to start a log file which will be unique to this instance of impute.do:

log using impute`jobID'.log, replace

The ten instances of impute.do will thus create ten log files, named impute1.log, impute2.log, etc. and you can check them individually as needed.

For reproducibility you want to set the seed for the random number generator, but each instance needs a different seed. One easy solution is to set the seed to an arbitrary number plus the instance's jobID:

set seed `=123454321+`jobID''

mi

Load the data set you prepared for mi in setup.do:

use missing_data_mi

When mi is actually imputing it creates temporary files in the current directory, and if the ten instances of impute.do are trying to put their temporary files in the same directory they'll interfere with each other. Thus you need to create a directory for each instance (we'll remove them later) and make that the current directory:

mkdir impute`jobID'
cd impute`jobID'

Now you're ready to actually impute:

mi impute chained (regress) x* y, add(1)

When the imputation is complete, go back to the original directory and remove the one you created.

cd ..
rmdir impute`jobID'

Note that if your version of impute.do crashes for whatever reason before deleting the directories it creates, you'll need to delete them yourself before running it again.

Finally save the file, including the jobID in the name to make it unique:

save impute`jobID',replace

ice

First load the data, then use ice to do the imputation. Use the clear option so the imputed data replaces the old data in memory rather than being saved as a file (since we're not done with it).

use missing_data
ice x* y, m(1) clear

In order to use mi's tools to combine imputations you need to convert the data to mi's format. This is done with mi import:

mi import ice, imputed(x* y) clear

The imputed option tells mi that the x variables and y were imputed, but in order to combine imputations mi also needs to be explicitly told that id was not imputed. Do that by registering it as a regular variable:

mi register regular id

mi import ice puts the data in flong format, since that's basically what ice uses. flong duplicates complete cases unnecessarily, so change the format to mlong (again, type help mi_styles if you want to learn more about the formats mi can use):

mi convert mlong, clear

Now you're ready to save the results. Include jobID in the file name to make it unique.

save impute`jobID',replace

Combining (combine.do)

When all ten instances of impute.do have completed, you'll have ten data sets in your current directory named impute1.dta through impute10.dta. Be sure they're all there before proceeding. You can use the Linux condor_q command on Kite to see if any instances of impute.do are still running. Condor will also email you when each job is completed. (Condor has a tool called DAGMan that can do things like automatically run combine.do when all the instances of impute.do are done, but it's somewhat cumbersome. If you're interested in using it contact the help desk for assistance.)

Your next task is to combine the ten files with one imputation each into one file with ten imputations. Begin (after the usual housekeeping) by loading the first file:

use impute1

Next loop over the remaining files (i.e. 2 through 10), adding them to what's already in memory using mi add:

forvalues jobID=2/10 {
mi add id using impute`jobID', assert(match)
}

mi add is a bit like a merge, so you need to specify a key variable (in this case id) so it knows which observations to combine. The assert(match) option tells it that every observation should match and the do file should halt if any fail to match. Given how we created these files, a failure to match would mean something went very wrong.

All that's left is to save the output file containing all ten imputations:

save imputed_data, replace

Code for the Do Files

Following is complete code for the do files described in this article. They assume that all the do files and the data are in your current directory, and that you're using an SSCC Linux server and thus can submit jobs to our Condor flock.

setup.do (mi version)

clear all
set more off
capture log close
log using setup.log, replace

use missing_data
mi set mlong
mi register imputed x* y
save missing_data_mi, replace

forvalues jobID=1/10 {
shell condor_stata impute `jobID' &
}

log close

setup.do (ice version)

clear all
set more off
capture log close
log using setup.log,replace

forvalues jobID=1/10 {
shell condor_stata -b do impute `jobID' &
}

log close

impute.do (mi version)

args jobID
log using impute`jobID'.log, replace
set seed `=123454321+`jobID''
use missing_data_mi
mkdir impute`jobID'
cd impute`jobID'
mi impute chained (regress) x* y, add(1)
cd ..
rmdir impute`jobID'
save impute`jobID',replace
log close

impute.do (ice version)

args jobID
log using impute`jobID'.log, replace
set seed `=123454321+`jobID''
use missing_data
ice x* y, m(1) clear
mi import ice, imputed(x* y) clear
mi register regular id
mi convert mlong, clear
save impute`jobID',replace
log close

combine.do (both)

clear all
set more off
capture log close
log using combine.log, replace

use impute1
forvalues jobID=2/10 {
mi add id using impute`jobID', assert(match)
}
save imputed_data, replace

What if I don't have access to a Condor flock?

This article was primarily written to help SSCC members take advantage of the power of the SSCC's Condor flock, but the techniques described can also be used to take advantage of all the CPUs in today's multi-CPU computers. Just replace the command that submits the impute.do jobs to Condor with a command that runs them on your computer.

For example, to run these jobs on a Windows PC you might replace:

shell condor_stata -b do impute `jobID' &

with:

winexec "C:\Program Files (x86)\Stata12\StataMP-64.exe" -b do impute `jobID'

The winexec command is very similar to shell, but tells Stata not to wait for the job to finish (since Windows doesn't use & for that). You may need to experiment to find the exact command that will work on your computer.

Some other considerations:

  • Do not submit more jobs than you have CPUs, or they'll just compete for computing time. If you have a two CPU ("Dual Core") computer but want ten imputations, submit two jobs that create five imputations each.
  • Make sure you have enough memory for all the jobs you submit. If Stata needs to use disk space as virtual memory it will slow down tremendously.
  • If your profile.do tells Stata to start in a particular directory, remove that command temporarily so the impute.do jobs start in the same directory as setup.do.
  • If you're not the only one using the computer be sure to leave enough CPUs for others. (SSCC members should never use this technique on Winstat--that's what Condor is for.)

Last Revised: 2/8/2012