This is part four of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.
This article will teach you how to get descriptive statistics, do basic hypothesis testing, run regressions, and carry out some postestimation tasks. This is a very small sample of Stata's capabilities, but it will give you a sense of how Stata's statistical commands work.
A good place to start with any new data set is describe. This gives you information about the data set, including the amount of memory it needs and a list of all its variables and their types and labels. Especially watch out for value labels. If you have a large data set and only need information about a few of them, you can give describe a varlist:
describe foreign
For more information about your variables try the Properties window or the Variables Manager (third button from the right or type varman).
summarize (sum) gives you summary statistics. If you just type:
sum
you will get basic summary statistics for all the variables in your data set. Note that there is nothing for make: it is a string variable so summary statistics don't make sense. Also note that for rep78 the number of observations is 69 rather than 74. That's because the five missing values were ignored and the summary statistics calculated over the remaining 69. Most statistical commands take a similar approach to missing values and that's usually what you want, so you rarely have to include special handing for missing values in statistical commands.
All the syntax elements you learned earlier also work with statistical commands. To get summary statistics for just mpg, give sum a varlist:
sum mpg
If you want summary statistics for just the foreign cars, add an if condition:
sum mpg if foreign
If you want summary statistics of mpg for both foreign and domestic cars but calculated separately, use by:
by foreign: sum mpg
The details (d) option will give more information. Try:
sum mpg, d
tabulate (tab) will create tables of frequencies. If you give it a varlist with one variable it will give you a one-way table, while if you give it two variables it will give you a two-way table. To get an idea of what tab does, try:
tab rep78
tab rep78 foreign
Tables are usually easier to read if the variable with the most unique values comes first, so they're listed vertically.
Note that the missing values of rep78 were ignored. If you'd like them to have their own entry, add the missing option:
tab rep78, missing
The tab command won't accept more than two variables, but you can create three-way or higher tables by combining tab with by:.
by foreign: tab headroom rep78
To get percentages, add the row, column or cell options:
tab rep78 foreign, row column cell
For this table, row answers the question "What percentage of the cars with a rep78 of one are domestic?" while column answers "What percentage of the domestic cars have a rep78 of one?" and cell answers "What percentage of all the cars are both domestic and have a rep78 of one?"
If you add an if condition, you'll get the frequencies of just those observations which meet it:
tab rep78 foreign if mpg>25
tab has an option called sum which gives summary statistics for a given variable, calculated over the observations in each cell of the table. Try:
tab foreign, sum(mpg)
There's also a chi2 option that runs a chi-squared test on a two-way table:
tab rep78 foreign, chi2
correlate (cor) calculates correlations:
cor weight length mpg
If you need covariances instead, add the cov option:
cor weight length mpg, cov
ttest tests hypotheses about means. To test whether the mean of a variable is equal to a given number, type ttest var==number:
ttest mpg==20
To test whether two variables have the same mean, type ttest var1==var2:
ttest mpg==weight
To test whether two subsamples of your data have the same mean for a given variable, use the by() option:
ttest mpg, by(foreign)
Stata has many, many commands for doing various kinds of regressions, but its developers worked hard to make them all as similar as possible. Thus if you can do a simple linear regression you can do all sorts of more complex models.
The regress (reg) command does linear regression. It always needs a varlist, and it uses it in a particular way: the first variable is the dependent variable, and it is regressed on all the others in the list plus a constant (unless you add the noconstant option).
Let's estimate how much consumers were willing to pay for good gas mileage in 1978 using a naive hedonic pricing model. Whether a car is foreign or domestic seems to be important, so throw that in too. Type:
regress price mpg foreign
This regresses price on mpg and foreign. The results suggest that American consumers disliked fuel efficiency, and would pay to avoid it!
Like any good researcher, when our empirical results contradict our theory we first look for better empirical results. We might possibly have some missing variable bias here; in particular it's probably important to control for the size of the car. Looking over the variables we see lots of variables related to size. You could include them all, but they're probably highly correlated and you don't want to introduce collinearity. Check using the correlate (cor) command.
cor weight length displacement trunk headroom
While all the variables are positively correlated, weight, trunk, and headroom aren't too bad so go ahead and add all three:
reg price mpg foreign weight trunk headroom
Now mpg is insignificant but weight is highly significant. Looks like Americans liked big cars and didn't care about fuel efficiency. That I'll believe.
Logistical regression is just as easy, but we need a binary dependent variable. Make an indicator variable goodRep which is one for cars with rep78 greater than three (and missing if rep78 is missing):
gen goodRep=(rep78>3) if rep78<.
Now let's examine what predicts a car's repair record. We'll include mpg, displacement and gear_ratio because they're the only technical data we have about the car's engine (the most likely thing to break), weight as a measure of load on the engine, and price and foreign just because they seem to be important characteristics of a car.
There are two different commands for running logistic models. They do the same thing, but report their results differently.
logit reports coefficients:
logit goodRep mpg displacement gear_ratio weight price foreign
while logistic reports odds ratios:
logistic goodRep mpg displacement gear_ratio weight price foreign
In collapsing the five point rep78 scale to the indicator goodRep we threw away a lot of information. But we can work with all the values of rep78 in a multinomial logistic model, run using mlogit:
mlogit rep78 mpg displacement gear_ratio weight price foreign
This tells us how each variable affects the probability of a car being in a given rep78 category as opposed to the base category. (Stata chose three as the base category because it is the most common.) However, we shouldn't take the model too seriously for rep78==1 or rep78==2 because foreign is one of the most important variables and no foreign cars have a rep78 of one or two. To exclude them, add an if condition:
mlogit rep78 mpg displacement gear_ratio weight price foreign if rep78>=3
Note that different variables are significant for rep78==4 than for rep78==5. Our logit model couldn't see those differences.
Multinomial logit is a complex model: difficult to compute and difficult to interpret. But Stata makes running it very easy.
Consider the variable rep78: it is a measure of the car's repair record and takes on the values one through five (plus a few missing values). However, these numbers only represent categories--a car with a rep78 of five is not five times better than a car with a rep78 of one. Thus it would make no sense to include rep78 in a regression as-is. However, you might want to include a set of indicator variables, one for each value of rep78. This is even more important for categorical variables with no underlying order, like race. Stata can create such indicator variables for you "on the fly"; in fact you can treat them as if they were always there.
The set of indicator variables representing a categorical variable is formed by putting i. in front of the variable's name. This works in most (but not all) varlists. To see how it works, try:
list rep78 i.rep78
As you see, 3.rep78 is one if rep78 is three and zero otherwise. The other indicators are constructed in the same way. 1b.rep78 is a special case: it is the base category, and always set to zero to avoid the "dummy variable trap" in regressions. If rep78 is missing, all the indicator variables are also missing.
If you want to choose a different category as the base, add b and then the number of the desired base category to the i:
list rep78 ib3.rep78
Now try using i.rep78 in a regression:
reg price weight foreign i.rep78
The coefficients for each value of rep78 are interpreted as the expected change in price if a car moved to that value of rep78 from the base value of one. If you change the base category:
reg price weight ib3.rep78
the model is the same, but the coefficients are now the expected change in price if a car moves to that value of rep78 from a rep78 of three. You can verify that the models are equivalent by noting that the coefficents in the second model are just the coefficients of the first model minus the coefficient for 3.rep78 from the first model.
You don't have to use the full set of indicators. For example, you could pick out just the indicator for rep78 is five with:
reg price weight 5.rep78
This has the effect of collapsing all the other categories into a single category of "not five."
Indicator variables are, in a sense, categorical variables. Marking them as such will not affect your regression output; you'll get the same results from:
reg price weight ib3.rep78 foreign
as from:
reg price weight ib3.rep78 i.foreign
However, the latter tells Stata that foreign is not continuous, which is very important to some postestimation commands. For example, if you put foreign in your model then the margins command (which we'll disuss shortly) will cheerfully calculate what the effect on price would be if all the cars became "slightly more foreign." If you put i.foreign in your model, margins will know cars are either foreign or not foreign and act accordingly. However, if you're not planning to run margins or some other postestimation command that cares about this distinction, putting foreign in your model rather than i.foreign is just fine.
You can add interactions between variables by putting two pound signs between them:
reg price weight foreign##rep78
The variables in an interaction are assumed to be caregorical unless you say otherwise. The main effects of both variables are included automatically. Thus the above model includes everything in:
reg price weight i.foreign i.rep78
What it adds is a new set of indicator variables, one for each unique combination of foreign and rep78. This allows the model to see, for example, whether the effect of having a rep78 of five is different for foreign cars than for domestic cars.
Note that while Stata chose rep78==1 for its base category, it had to drop the rep78==5 category for foreign cars because no foreign cars have a rep78 of one. If you'd prefer that it drop the same category for both types of cars, choose a different base category:
reg price weight foreign##ib3.rep78
You can specify that a model should include only interaction effects and not main effects by putting one pound sign between the variables (foreign#rep78) but this is almost always a mistake.
To form interactions involving a continuous variable, use the same syntax but put c. in front of the continuous variable's name:
reg price foreign##c.weight i.rep78
This allows the effect of weight on price to be different for foreign cars than for domestic cars (i.e. they can have different slopes).
The ## symbol is an operator just like + or -, so you can use parentheses with the usual rules:
reg price foreign##(c.weight rep78)
This interacts foreign with both weight and rep78. The latter is automatically treated as a categorical variable since it appears in an interaction and does not have c. in front of it.
Interactions are formed by multiplication: to form an indicator for "car is foreign and has a rep78 of 5" multiply an indicator for "car is foreign" by an indicator for "car has a rep78 of 5." But this is not limited to indicators:
reg price c.weight##c.weight
This regresses price on weight and weight squared, allowing you to consider non-linear effects of weight (at least second order Taylor series approximations to them). You could estimate the same model with:
gen weightSquared=weight^2
reg price weight weightSquared
Specifying the model using interactions is shorter, obviously. But it also (again) helps postestimation commands understand the structure of the model. If you asked margins to find the effect on price of changing weight after running the second model, it would not take into account the fact that changing weight also changes weightSquared. If you specify the squared term using interactions, postestimation commands will understand the relationship between them.
Most statistical commands also save their results so that you can use them in subsequent commands. You can see what is saved with the return list command. To see a typical example, try:
sum mpg
return list
These saved results are often referred to as the r vector.
Suppose you want to standardize mpg, meaning you want to subtract its mean and divide by its standard deviation. Both of those quantities are available in the returned results, and the command is:
gen mpgStandardized=(mpg-r(mean))/r(sd)
A standardized variable has a mean of zero and a standard deviation of one, so you can check the results with:
sum mpgStandardized
The mean isn't quite zero due to round-off error, but it's as close as a computer can get.
If you type:
return list
again, you'll see that the contents of the r vector have changed. It only contains the results of the most recent command, so if you need to use any of those results be sure to do so (or store them in variables) before running any other commands that use the r vector.
Estimation commands store values in the e vector, which can be viewed with the ereturn list command. Try:
mlogit rep78 weight foreign price
ereturn list
The e(sample) function tells you whether a particular observation was in the sample used for the previous regression. It is 1 (true) for observations that were included and 0 (false) for observations that were not. In this case, the five observations with missing values of rep78 were excluded. e(sample) can be very useful if you think missing data may be causing problems with your model. For example, you could type:
tab foreign if e(sample)
to check which values of foreign actually appear in the data used in the regression. Or:
sum mpg
sum mpg if e(sample)
will tell you if the mean value of mpg is different for the observations used than for the entire data set, which could indicate that the data are not missing at random.
Regression coefficients are stored in the e(b) matrix. It's possible to extract and use them but it requires working with matrices, which will not be covered in this series.
Most of the time you won't use the e vector directly. Instead you'll use Stata's postestimation commands. We'll cover just a small sample of them.
Run the following model:
reg mpg c.weight##c.weight i.foreign displacement gear_ratio
This regresses mpg on weight, weight squared, foreign, displacement and gear_ratio. While we don't need to specify that foreign is a factor variable for the regression to work, we will need it later. Note that the coefficient on foreign is negative: domestic cars do have a lower mean mpg, but that is due to their greater weight.
The test command tests hypotheses about the model. The syntax is just test plus a list of hypotheses, which are tested jointly. In setting up hypotheses, the name of a variable is taken to mean the coefficient on that variable.
test gear_ratio==1.5
tests whether the coefficient on gear_ratio is 1.5. If you list a variable all by itself it is assumed that you want to test whether its coefficient is zero:
test displacement gear_ratio
tests the hypothesis that the coefficients on displacement and gear_ratio are jointly zero. If you want to jointly test more complicated hypotheses, put each hypothesis in parentheses:
test (gear_ratio=1.5) (weight=-.02)
You can have variables on both sides of the equals sign:
test weight==displacement
Which is equivalent to:
test weight-displacement==0
The predict command puts the model's predicted values in a variable:
predict mpghat
("hat" refers to the circumflex commonly used to denote estimated values). You can calculate the residuals with:
gen res1=mpg-mpghat
or you can let predict do it for you:
predict res2, residuals
The result will be the same except for round-off error (since predict uses double precision internally it will have less round-off error, but unless your data have seven digits of precision it doesn't matter).
You can change your data between running the model and making the predictions, which means you can look at counterfactual scenarios like "What if all the domestic cars were foreign?" See Making Predictions with Counter-Factual Data in Stata for some examples.
The margins command is a useful tool for exploring what your regression results mean. For example, if you want to look at the marginal effect of a variable, or the derivative of the outcome with respect to that variable, you can use margins with the dydx() option:
margins, dydx(displacement)
In this simple case, the derivative is just the coefficient on displacement. But consider changing weight: since the model includes both weight and weight squared you have to take into account how both change. The result will depend on what weight was to begin with. But margins will do it all for you:
margins, dydx(weight)
What margins does here is take the numerical derivative of the mean expected mpg with respect to weight. In doing so, margins looks at the actual data. Thus it considers the effect of changing the Honda Civic's weight from 1,760 pounds as well as changing the Lincoln Continental's from 4,840 (the weight squared term is more important with the latter than the former). It then averages them along with all the other cars to get its result of -.006064, or that each additional pound of weight reduces the mean expected mpg by .006064 miles per gallon.
If you're interested in the change at a particular point, add the at() option. For example, to see how changing weight would change the expected mpg if all the cars weighed 4,000 pounds, type:
margins, dydx(weight) at(weight=4000)
A common point of interest is the mean. You can get the results there with atmeans:
margins, dydx(weight) atmeans
Note that this sets all the variables to their means, not just weight.
Once you get into nonlinear models like logit or mlogit the value of margins is even greater. For more information see Exploring Regression Results using Margins.
Next: Working with Groups
Previous: Working With Data
Last Revised: 1/18/2011
