Bootstrapping in Stata

Stata's bootstrap command makes it easy to bootstrap just about any statistic you can calculate. The results of almost all Stata commands can be bootstrapped immediately, and it's relatively straightforward to put any other results you've calculated in a form that can be bootstrapped. This article will show you how.

If you're just looking to bootstrap the results of a Stata command, all you'll need is a basic familiarity with Stata. However, if you need to calculate something else and then bootstrap it you'll need to write an official Stata program to do so. If you're not familiar with writing Stata programs (which are not the same as do files) you'll want to take a look at Programming in Stata, in particular the section on programs.

Bootstrapping Results from Stata Commands

If there is a single Stata command that calculates the result you need, you can simply tell Stata to bootstrap the result of that command. As an example, load the automobile data that comes with Stata and consider trying to find the mean of the mpg variable. The summarize (sum) command will do exactly what you want:

sysuse auto
sum mpg

But how will the bootstrap command find the number it needs in all that output? The answer is that you will tell it where to look in the return vector.

The Return Vector

In addition to the output you see on the screen or in your log, all Stata commands quietly put their results in a return vector. You can refer to this vector in subsequent commands, or in the case of bootstrap you can tell it what part of the return vector you care about.

To see the current contents of the return vector, type

return list

The sum command is a basic command (as opposed to an estimation command) so its return vector is called r(). Looking over the list, you'll see that r(mean) is the number you want. You're now ready to actually carry out the bootstrap.

The bootstrap Command Syntax

The basic syntax for a bootstrap command is simple:

bootstrap var=r(result): command

Here var is simply what you want to call the quantity you're bootstrapping. You're welcome to choose any name you like as long as it meets the usual rules for a Stata variable name. In our case meanMPG would be appropriate.

r(result)tells the bootstrap command to look in the r() vector for the particular result you're interested in. We're interested in r(mean).

Finally command should be replaced by the actual command that calculates the result you want. In our case it's sum mpg.

Putting this all together, the command to bootstrap the mean of the variable mpg is simply:

bootstrap meanMPG=r(mean): sum mpg

When you run that you'll get a note explaining that bootstrap can't exclude missing values and such unless you're working with an estimation command (more on them shortly) but that won't be a problem in this case. The results you want will follow.

What if you wanted to bootstrap two different quantities? No problem, just list them both:

bootstrap meanMPG=r(mean) maxMPG=r(max): sum mpg

Bootstrapping Estimation Commands

Estimation commands are slightly different in that they store their results in the e() vector rather than the r() vector and must be listed by typing ereturn list rather than return list. To see this, type the following:

reg mpg weight foreign
ereturn list

One warning: bootstrap is an estimation command, so after running it the e() vector will contain the results of the bootstrap, not the results of the command you were bootstrapping.

Suppose you wanted to bootstrap the F-statistic for some odd reason. All you'd have to do is type:

bootstrap f=e(F): reg mpg weight foreign

A more common example would be to bootstrap the coefficients. They're available in e(b) but that's a matrix so getting at them individually would be complicated. Fortunately this is so common that it's set up as a convenient special case: if bootstrap is given nothing to bootstrap, it will look for an e(b) matrix and bootstrap that. Thus all you need to type is:

bootstrap: reg mpg weight foreign

Bootstrap Options

The bootstrap command has a fair number of options available. The nowarn option will get rid of that annoying message about e(sample) that you got after our first example. The reps option allows you choose how many bootstrap replications are performed--the default is 50. For a full list of options type help bootstrap.

However, all these options apply to the bootstrap command and not to the command you're bootstrapping. Thus they go after a comma as always, but before the semicolon that ends the bootstrap part of the command. You could then have another comma at the end of the command to be bootstrapped, followed by options that apply to it. For example:

bootstrap perc90=r(p90), nowarn reps(25): sum mpg, detail

This bootstraps the 90th percentile of mpg, which is only available if sum is given the detail option. It also suppresses the warning message and only does 25 replications. Note where all those options are located in the command.

Bootstrapping Results You've Calculated

If all you need to do is bootstrap the results of existing Stata commands you may want to stop here, especially since things are about to get a bit more complicated.

If there's no single Stata command that will calculate a result you want to bootstrap, you'll just have to write your own. As you hopefully know from reading Programming in Stata, Stata allows you to write programs that act like regular Stata commands. You can even make them return results so that they'll work with bootstrap.

Suppose you wanted to bootstrap the statistic "Mean weight of those cars in the top quartile for mpg." Calculating the statistic isn't hard to do:

xtile quartile=mpg, nq(4)
sum weight if quartile==4

But since it requires two commands it can't be bootstrapped as is. We'll need to write a program that carries out those two steps and returns the result in r().

program define topQuartileMean, rclass
xtile quartile=mpg, nq(4)
sum weight if quartile==4
return scalar tqm=r(mean)
drop quartile
end

Most of this should be familiar, but there are a few additional elements that need to be explained.

Adding the rclass option to the program definition tells Stata that this program will be putting things in the r() vector. The return command is what actually does so, and scalar means this particular result is a single number as opposed to a matrix like e(b). We're calling our returned value tqm (as in top quartile mean) so it will be available after the program runs as r(tqm). The number we're putting in it is the r(mean) result from the previous sum command--not a result of our topQuartileMean program, which doesn't have results yet.

Also note that we need to drop the quartile variable at the end so we can create a new one in the next bootstrap replication.

Now that the program topQuartileMean is defined, you can use it with bootstrap just like any other Stata command:

bootstrap tqm=r(tqm): topQuartileMean

You'll then get your results.

Last Revised: 2/7/2008