Stata for Researchers: Statistics

This article will teach you how to get descriptive statistics, do basic hypothesis testing, run regressions, and carry out some postestimation tasks. This is a very small sample of Stata's capabilities, but it will give you a sense of how Stata's statistical commands work.

General Information

A good place to start with any new data set is describe. This gives you information about the data set, including the amount of memory it needs and a list of all its variables and their types and labels. Especially watch out for value labels. If you have a large data set and only need information about a few of them, you can give describe a varlist:

describe foreign

For more information about your variables try the Variables Manager (third button from the right or type varman).

Summary Statistics

summarize (sum) gives you summary statistics. If you just type:

sum

you will get basic summary statistics for all the variables in your data set. Note that there is nothing for make: it is a string variable so summary statistics don't make sense. Also note that for rep78 the number of observations is 69 rather than 74. That's because the five missing values were ignored and the summary statistics calculated over the remaining 69. Most statistical commands take a similar approach to missing values and that's usually what you want, so you rarely have to include special handing for missing values in statistical commands.

All the syntax elements you learned earlier also work with statistical commands. To get summary statistics for just mpg, give sum a varlist:

sum mpg

If you want summary statistics for just the foreign cars, add an if condition:

sum mpg if foreign

If you want summary statistics of mpg for both foreign and domestic cars but calculated separately, use by:

by foreign: sum mpg

The details (d) option will give more information. Try:

sum mpg, d

Frequencies

tabulate (tab) will create tables of frequencies. If you give it a varlist with one variable it will give you a one-way table, while if you give it two variables it will give you a two-way table. To get an idea of what tab does, try:

tab rep78
tab rep78 foreign

Tables are usually easier to read if the variable with the most unique values comes first, so they're listed vertically.

Note that the missing values of rep78 were ignored. If you'd like them to have their own entry, add the missing option:

tab rep78, missing

The tab command won't accept more than two variables, but you can create three-way or higher tables by combining tab with by:.

by foreign: tab headroom rep78

To get percentages, add the row, column or cell options:

tab rep78 foreign, row column cell

For this table, row answers the question "What percentage of the cars with a rep78 of one are domestic?" while column answers "What percentage of the domestic cars have a rep78 of one?" and cell answers "What percentage of all the cars are both domestic and have a rep78 of one?"

If you add an if condition, you'll get the frequencies of just those observations which meet it:

tab rep78 foreign if mpg>25

tab has an option called sum which gives summary statistics for a given variable, calculated over the observations in each cell of the table. Try:

tab foreign, sum(mpg)

There's also a chi2 option that runs a chi-squared test on a two-way table:

tab rep78 foreign, chi2

Correlations

correlate (cor) calculates correlations:

cor weight length mpg

If you need covariances instead, add the cov option:

cor weight length mpg, cov

Hypothesis Tests of Means

ttest tests hypotheses about means. To test whether the mean of a variable is equal to a given number, type ttest var==number:

ttest mpg==20

To test whether two variables have the same mean, type ttest var1==var2:

ttest mpg==weight

To test whether two subsamples of your data have the same mean for a given variable, use the by() option:

ttest mpg, by(foreign)

Exercises

  1. Find the mean value of weight for cars with mpg greater than 25.
  2. You already ran a chi-squared test which rejected the hypothesis that foreign and rep78 are unrelated, but perhaps that's due to the few domestic cars with rep78<3 (while no foreign cars have rep78<3). Repeat the test, excluding those cars.
  3. Test the hypothesis that cars with mpg>25 have a lower mean weight than cars with mpg<=25. You'll have to create a new variable to do so.

Regression

Stata has many, many commands for doing various kinds of regressions, but its developers worked hard to make them all as similar as possible. Thus if you can do a simple linear regression you can do all sorts of more complex models.

Linear Regression

The regress (reg) command does linear regression. It always needs a varlist, and it uses it in a particular way: the first variable is the dependent variable, and it is regressed on all the others in the list plus a constant (unless you add the noconstant option).

Let's estimate how much consumers were willing to pay for good gas mileage in 1978 using a naive hedonic pricing model. Whether a car is foreign or domestic seems to be important, so throw that in too. Type:

regress price mpg foreign

This regresses price on mpg and foreign. The results suggest that American consumers disliked fuel efficiency, and would pay to avoid it!

Like any good researcher, when our empirical results contradict our theory we first look for better empirical results. We might possibly have some missing variable bias here; in particular it's probably important to control for the size of the car. Looking over the variables we see lots of variables related to size. You could include them all, but they're probably highly correlated and you don't want to introduce collinearity. Check using the correlate (cor) command.

corer weight length displacement trunk headroom

While all the variables are positively correlated, weight, trunk, and headroom aren't too bad so go ahead and add all three:

reg price mpg foreign weight trunk headroom

Now mpg is insignificant but weight is highly significant. Looks like Americans liked big cars and didn't care about fuel efficiency. That I'll believe.

Logistical Regression

Logistical regression is just as easy, but we need a binary dependent variable. Make an indicator variable goodRep which is one for cars with rep78 greater than three (and missing if rep78 is missing):

gen goodRep=(rep78>3) if rep78<.

Now let's examine what predicts a car's repair record. We'll include mpg, displacement and gear_ratio because they're the only technical data we have about the car's engine (the most likely thing to break), weight as a measure of load on the engine, and price and foreign just because they seem to be important characteristics of a car.

There are two different commands for running logistic models. They do the same thing, but report their results differently.

logit reports coefficients:

logit goodRep mpg displacement gear_ratio weight price foreign

while logistic reports odds ratios:

logistic goodRep mpg displacement gear_ratio weight price foreign

Multinomial Logit

In collapsing the five point rep78 scale to the indicator goodRep we threw away a lot of information. But we can work with all the values of rep78 in a multinomial logistic model, run using mlogit:

mlogit rep78 mpg displacement gear_ratio weight price foreign

This tells us how each variable affects the probability of a car being in a given rep78 category as opposed to the base category. (Stata chose three as the base category because it is the most common.) However, we shouldn't take the model too seriously for rep78==1 or rep78==2 because foreign is one of the most important variables and no foreign cars have a rep78 of one or two. To exclude them, add an if condition:

mlogit rep78 mpg displacement gear_ratio weight price foreign if rep78>=3

Note that different variables are significant for rep78==4 than for rep78==5. Our logit model couldn't see those differences.

Multinomial logit is a complex model: difficult to compute and difficult to interpret. But Stata makes running it very easy.

Postestimation

Once you've run a model, Stata stores vital information about the model for use in later commands (in the e() vector--see Programming in Stata if you're interested). This allows you to run a variety of postestimation commands. Again, we can only give you a sampling.

Before beginning, let's run a simpler model:

reg mpg c.weight##c.weight i.foreign displacement gear_ratio

This regresses mpg on weight, weight squared, foreign, displacement and gear_ratio. While we don't need to specify that foreign is a factor variable for the regression to work, we will need it later. Note that the coefficient on foreign is negative: domestic cars do have a lower mean mpg, but that is due to their greater weight.

Hypothesis Testing

The test command tests hypotheses about the model. The syntax is just test plus a list of hypotheses, which are tested jointly. In setting up hypotheses, the name of a variable is taken to mean the coefficient on that variable.

test foreign==-3

tests whether the coefficient on foreign is -3. If you list a variable all by itself it is assumed that you want to test whether its coefficient is zero:

test displacement gear_ratio

tests the hypothesis that the coefficients on displacement and gear_ratio are jointly zero.

You can have variables on both sides of the equals sign:

test weight==displacement

Which is equivalent to:

test weight-displacement==0

Predicted Values

The predict command puts the model's predicted values in a variable:

predict mpghat

("hat" refers to the circumflex commonly used to denote estimated values). You can calculate the residuals with:

gen res1=mpg-mpghat

or you can let predict do it for you:

predict res2, residuals

The result will be the same except for round-off error (since predict uses double precision internally it will have less round-off error, but unless your data have seven digits of precision it doesn't matter).

You can change your data between running the model and making the predictions, which means you can look at counterfactual scenarios like "What if all the domestic cars were foreign?" See Making Predictions with Counter-Factual Data in Stata for some examples.

Margins

On the other hand, all the examples in the above article could be done more easily with the new margins command. It's designed to let you examine the influence of variables in your model. Try:

margins foreign

This sets foreign to zero for all cars, leaving the other variables unchanged, finds the predicted mpg for each car, and then averages them. It then sets foreign to one for all cars and repeats the process. Thus if all cars were domestic, our model predicts that the mean mpg would be 22.18, while of all cars were foreign the mean mpg would be 19.22.

An alternative approach is to set all variables to their means other than foreign. You can do this with the atmeans option:

margins foreign, atmeans

The foreign variable can only take on two values (Stata knows this because we marked it as i.foreign in the original regression) so the margins command calculated its results for both of them. Obviously we can't look at all possible values for continuous variables, so for continuous variables we have to specify the values we're interested in with the at() option:

margins, at(displacement==200)

This tells us if all cars had a displacement of 200, the mean predicted mpg would be 21.28. If you want to look at multiple values, list them in parentheses. You can also specify a range like 200/300 for all values between 200 and 300.

margins, at(displacement==(200 300))

If you want to look at the marginal effect of a variable, or the derivative of the outcome with respect to that variable, use the dydx() option.

margins, dydx(displacement)

In this simple case, the derivative is just the coefficient on displacement. But recall that we included both weight and weight squared. Thus:

margins, dydx(weight)

is much more useful. Keep in mind that the result is a function of weight. The above was calculated at the mean, but you might also want to try different values of weight with at().

Once you get into nonlinear models like logit or mlogit the value of margins with dydx() is even greater.

Exercises

  1. Go back to the regression of price on mpg, foreign, weight, trunk and headroom. Add weight squared to the regression and interpret the results. Does increasing weight ever reduce the price in this data set?
  2. Since weight squared made such a difference, try adding mpg squared. Test whether the coefficients on mpg and mpg squared are jointly zero. (Hint: refer to coefficients by the names given in the regression output, even if they aren't "real" variable names.)

Next: Working with Groups

Previous: Working With Data

Last Revised: 10/6/2009