Making Predictions with Counter-Factual Data in Stata

Note: this article has been superseded by the very useful margins command. It is kept here because margins cannot be used in some contexts, such as multiple imputation.

Social Science researchers often want to ask hypothetical questions: How would the income distribution in my sample change if all the black people were white? How would the household structure in my sample be different if the demographics hadn't changed since 1970? You can try to answer such questions by first estimating a model, then seeing what that model predicts when you give it counter-factual data.

If you're using a linear model it's just a matter of multiplying the change in a dependant variable by its coefficient. But non-linear models are more complicated. Fortunately Stata makes this kind of work very easy, and this article will show you how.

Example: Car Quality

In the 1970's a perception arose that cars produced in the United States were less reliable than cars produced in other countries (especially Japan). Investigations into the cause of this difference helped launch the "quality" movement which became a major buzzword in the business community through the 80's and 90's. We'll use the 1978 automobile data set that comes with Stata to examine this difference and try to answer the question "If all the cars in our sample had been built outside the United States, how would that change their repair records?"

Start up Stata, then load the 1978 automobile data by typing

sysuse auto

Note that sysuse loads data from wherever Stata is installed. That means it's only useful for loading the sample automobile data, but it does allow us to ignore the fact that different versions of Stata store it in different locations.

The rep78 variable is the measure we'll use for car quality. It is a five point scale, with 5 being the best. To see it type

tab rep78, missing

Several cars have missing values for rep78 which makes them useless for our analysis. So we'll drop them:

drop if rep78==.

Of course if this were actual research we'd have to think about whether this would bias our sample.

We also have a variable called foreign. To see its values type

tab foreign

tab foreign, nolabel

Let's begin by seeing if there's any evidence for the hypothesis that foreign cars are more reliable:

cor rep78 foreign

by foreign: sum rep78

tab rep78 foreign, chi2

Clearly there's some basis for the perception.


To examine this further, let's begin with a simple logistic regression. Since logistic can handle just two outcomes, we'll condense the five-point rep78 scale into the indicator variable highQuality:

gen highQuality=(rep78>3)

This creates a variable which takes on the value one for cars with rep78>3 and zero for others. Note that cars with missing values for rep78 would be counted as high quality with this code, which is one reason we dropped them right away.

Now let's run a logistic regression with highQuality as the dependant variable. Clearly we want foreign as one of our independent variables. What else to include is a difficult question, especially for non-engineers. One might expect more expensive cars to be more reliable, so we'll include price. Since most car problems involve the engine, characteristics of the engine seem relevant. Thus we'll include displacement and gear_ratio, along with weight as a measure of the load on the engine. The command is:

logit highQuality foreign price displacement gear_ratio weight

add the or option if you prefer odds ratios to marginal effects.

The results are mostly negative (so much for expensive cars being more reliable) but the coefficient on foreign is significant.

How much difference would it make if the cars were all foreign? To begin, we'll calculate and store the predicted probability of each car being high quality under our model:

predict p

Type help predict for full details on the predict command, but its basic function is to make predictions using whatever regression model you ran last.

The mean of p will be the same as the proportion of cars which are high quality. To see that type

sum p highQuality

But when predict runs it uses whatever data are in memory at the time. It doesn't know or care if the data have changed since the regression was run. This allows us to set up a counter-factual scenario and then use predict to see what our model says about it.

We're going to change the value of foreign, but since we'll want to change it back we'll store the real value in a separate variable first:

gen realForeign=foreign

Then we'll pretend that all the cars are foreign by setting foreign to one:

replace foreign=1

Now we simply run predict again to generate the predicted probability for this counter-factual scenario:

predict cfp

Now compare the counter-factual prediction with reality by typing:

sum cfp highQuality

As you see, this model suggests that if all these cars were produced overseas, the proportion which are high quality would increase from 42% to 87%.

Multinomial Logit

By condensing the five point scale of rep78 into the indicator variable highQuality we obviously threw away a lot of information. We can use rep78 directly if we use a multinomial logit to examine the probabilities of all five outcomes.

We'll begin by estimating the model. Since we always want to estimate the model using the real data, we need to set foreign back to its actual values:

replace foreign=realForeign

Next we'll run the exact same regression as before, except that we'll use rep78 as our dependent variable instead of highQuality and use multinomial logit:

mlogit rep78 foreign price displacement gear_ratio weight

We now have four sets of coefficients, each describing how that variable changes the probability of getting a given outcome compared to the base outcome of rep78=3. We'll also have five predicted probabilities, one for each outcome:

predict m1-m5

Now we're ready to do the counter-factual:

replace foreign=1

predict cfm1-cfm5

sum m1-m5 cfm1-cfm5

We see that the predicted proportion with rep78=4 and rep78=5 is much higher with foreign set to one . Note how the predicted proportion for rep78=3 is much lower. Just looking at the regression coefficients, you might think that increasing foreign makes rep78=1 and rep78=2 less likely compared to rep78=3 and thus the proportion for rep78=3 should go up. But in fact the rep78=4 outcome is even more strongly affected, so rep78=3 ends up going down.

Of course in this case the coefficients for foreign and rep78=1 and rep78=2 are nonsense because no foreign cars in our sample have rep78<3. But it's a general principle that just looking at marginal effects in a multinomial logit can be deceptive.

Ordered Logit

By using multinomial logit we threw away the information that that the values of rep78 have an implied order. Ordered logit takes advantage of that information. The ologit command does ordered regression, and the syntax is identical to using mlogit:

replace foreign=realForeign

ologit rep78 foreign price displacement gear_ratio weight

predict o1-o5

replace foreign=1

predict cfo1-cfo5

sum o1-o5 cfo1-cfo5

Interestingly, ordered logit gives a much higher proportion in rep78=5 with the counter-factual scenario than multinomial logit. Which of these results to believe (if any) is left as an exercise for the reader.

Beyond Proportions

Naturally you're not limited to looking at the means of predicted probabilities and interpreting them as predicted proportions. Each observation has a full set of predicted probabilities which may be interesting in and of themselves.

Note however, that when we set foreign to 1 for all observations we reduced the variation in our sample. This data has lots of quasi-continuous variables so there's not much danger of creating duplicates. But if you have lots of categorical variables, many reasonable counter-factual scenarios will make all the observations which fall into a certain category or set of categories identical. That's not necessarily a problem, but it is an issue to be aware of.

For example, you may want to set up a counterfactual scenario where each individual is assigned the mean income for their gender, race, and state of residence in a different period. However, having done so you could no longer talk about the variation in income within a given gender/race/state combination. If you did something similar with all the individual variables in your data set, you'd then then have no variation whatsoever within each gender/race/state combination, and you should expect that any predictions you make will also not vary within those groups.

Last Revised: 10/17/2006