This article will teach you how to get descriptive statistics, do basic hypothesis testing, run regressions, and carry out some postestimation tasks. This is a very small sample of Stata's capabilities, but it will give you a sense of how Stata's statistical commands work.
A good place to start with any new data set is describe. This gives you information about the data set, including the amount of memory it needs and a list of all its variables and their types and labels. Especially watch out for value labels. If you have a large data set and only need information about a few of them, you can give describe a varlist:
describe foreign
For more information about your variables try the Variables Manager (third button from the right or type varman).
summarize (sum) gives you summary statistics. If you just type:
sum
you will get basic summary statistics for all the variables in your data set. Note that there is nothing for make: it is a string variable so summary statistics don't make sense. Also note that for rep78 the number of observations is 69 rather than 74. That's because the five missing values were ignored and the summary statistics calculated over the remaining 69. Most statistical commands take a similar approach to missing values and that's usually what you want, so you rarely have to include special handing for missing values in statistical commands.
All the syntax elements you learned earlier also work with statistical commands. To get summary statistics for just mpg, give sum a varlist:
sum mpg
If you want summary statistics for just the foreign cars, add an if condition:
sum mpg if foreign
If you want summary statistics of mpg for both foreign and domestic cars but calculated separately, use by:
by foreign: sum mpg
The details (d) option will give more information. Try:
sum mpg, d
tabulate (tab) will create tables of frequencies. If you give it a varlist with one variable it will give you a one-way table, while if you give it two variables it will give you a two-way table. To get an idea of what tab does, try:
tab rep78
tab rep78 foreign
Tables are usually easier to read if the variable with the most unique values comes first, so they're listed vertically.
Note that the missing values of rep78 were ignored. If you'd like them to have their own entry, add the missing option:
tab rep78, missing
The tab command won't accept more than two variables, but you can create three-way or higher tables by combining tab with by:.
by foreign: tab headroom rep78
To get percentages, add the row, column or cell options:
tab rep78 foreign, row column cell
For this table, row answers the question "What percentage of the cars with a rep78 of one are domestic?" while column answers "What percentage of the domestic cars have a rep78 of one?" and cell answers "What percentage of all the cars are both domestic and have a rep78 of one?"
If you add an if condition, you'll get the frequencies of just those observations which meet it:
tab rep78 foreign if mpg>25
tab has an option called sum which gives summary statistics for a given variable, calculated over the observations in each cell of the table. Try:
tab foreign, sum(mpg)
There's also a chi2 option that runs a chi-squared test on a two-way table:
tab rep78 foreign, chi2
correlate (cor) calculates correlations:
cor weight length mpg
If you need covariances instead, add the cov option:
cor weight length mpg, cov
ttest tests hypotheses about means. To test whether the mean of a variable is equal to a given number, type ttest var==number:
ttest mpg==20
To test whether two variables have the same mean, type ttest var1==var2:
ttest mpg==weight
To test whether two subsamples of your data have the same mean for a given variable, use the by() option:
ttest mpg, by(foreign)
Stata has many, many commands for doing various kinds of regressions, but its developers worked hard to make them all as similar as possible. Thus if you can do a simple linear regression you can do all sorts of more complex models.
The regress (reg) command does linear regression. It always needs a varlist, and it uses it in a particular way: the first variable is the dependent variable, and it is regressed on all the others in the list plus a constant (unless you add the noconstant option).
Let's estimate how much consumers were willing to pay for good gas mileage in 1978 using a naive hedonic pricing model. Whether a car is foreign or domestic seems to be important, so throw that in too. Type:
regress price mpg foreign
This regresses price on mpg and foreign. The results suggest that American consumers disliked fuel efficiency, and would pay to avoid it!
Like any good researcher, when our empirical results contradict our theory we first look for better empirical results. We might possibly have some missing variable bias here; in particular it's probably important to control for the size of the car. Looking over the variables we see lots of variables related to size. You could include them all, but they're probably highly correlated and you don't want to introduce collinearity. Check using the correlate (cor) command.
corer weight length displacement trunk headroom
While all the variables are positively correlated, weight, trunk, and headroom aren't too bad so go ahead and add all three:
reg price mpg foreign weight trunk headroom
Now mpg is insignificant but weight is highly significant. Looks like Americans liked big cars and didn't care about fuel efficiency. That I'll believe.
Logistical regression is just as easy, but we need a binary dependent variable. Make an indicator variable goodRep which is one for cars with rep78 greater than three (and missing if rep78 is missing):
gen goodRep=(rep78>3) if rep78<.
Now let's examine what predicts a car's repair record. We'll include mpg, displacement and gear_ratio because they're the only technical data we have about the car's engine (the most likely thing to break), weight as a measure of load on the engine, and price and foreign just because they seem to be important characteristics of a car.
There are two different commands for running logistic models. They do the same thing, but report their results differently.
logit reports coefficients:
logit goodRep mpg displacement gear_ratio weight price foreign
while logistic reports odds ratios:
logistic goodRep mpg displacement gear_ratio weight price foreign
In collapsing the five point rep78 scale to the indicator goodRep we threw away a lot of information. But we can work with all the values of rep78 in a multinomial logistic model, run using mlogit:
mlogit rep78 mpg displacement gear_ratio weight price foreign
This tells us how each variable affects the probability of a car being in a given rep78 category as opposed to the base category. (Stata chose three as the base category because it is the most common.) However, we shouldn't take the model too seriously for rep78==1 or rep78==2 because foreign is one of the most important variables and no foreign cars have a rep78 of one or two. To exclude them, add an if condition:
mlogit rep78 mpg displacement gear_ratio weight price foreign if rep78>=3
Note that different variables are significant for rep78==4 than for rep78==5. Our logit model couldn't see those differences.
Multinomial logit is a complex model: difficult to compute and difficult to interpret. But Stata makes running it very easy.
Once you've run a model, Stata stores vital information about the model for use in later commands (in the e() vector--see Programming in Stata if you're interested). This allows you to run a variety of postestimation commands. Again, we can only give you a sampling.
Before beginning, let's run a simpler model:
reg mpg c.weight##c.weight i.foreign displacement gear_ratio
This regresses mpg on weight, weight squared, foreign, displacement and gear_ratio. While we don't need to specify that foreign is a factor variable for the regression to work, we will need it later. Note that the coefficient on foreign is negative: domestic cars do have a lower mean mpg, but that is due to their greater weight.
The test command tests hypotheses about the model. The syntax is just test plus a list of hypotheses, which are tested jointly. In setting up hypotheses, the name of a variable is taken to mean the coefficient on that variable.
test foreign==-3
tests whether the coefficient on foreign is -3. If you list a variable all by itself it is assumed that you want to test whether its coefficient is zero:
test displacement gear_ratio
tests the hypothesis that the coefficients on displacement and gear_ratio are jointly zero.
You can have variables on both sides of the equals sign:
test weight==displacement
Which is equivalent to:
test weight-displacement==0
The predict command puts the model's predicted values in a variable:
predict mpghat
("hat" refers to the circumflex commonly used to denote estimated values). You can calculate the residuals with:
gen res1=mpg-mpghat
or you can let predict do it for you:
predict res2, residuals
The result will be the same except for round-off error (since predict uses double precision internally it will have less round-off error, but unless your data have seven digits of precision it doesn't matter).
You can change your data between running the model and making the predictions, which means you can look at counterfactual scenarios like "What if all the domestic cars were foreign?" See Making Predictions with Counter-Factual Data in Stata for some examples.
On the other hand, all the examples in the above article could be done more easily with the new margins command. It's designed to let you examine the influence of variables in your model. Try:
margins foreign
This sets foreign to zero for all cars, leaving the other variables unchanged, finds the predicted mpg for each car, and then averages them. It then sets foreign to one for all cars and repeats the process. Thus if all cars were domestic, our model predicts that the mean mpg would be 22.18, while of all cars were foreign the mean mpg would be 19.22.
An alternative approach is to set all variables to their means other than foreign. You can do this with the atmeans option:
margins foreign, atmeans
The foreign variable can only take on two values (Stata knows this because we marked it as i.foreign in the original regression) so the margins command calculated its results for both of them. Obviously we can't look at all possible values for continuous variables, so for continuous variables we have to specify the values we're interested in with the at() option:
margins, at(displacement==200)
This tells us if all cars had a displacement of 200, the mean predicted mpg would be 21.28. If you want to look at multiple values, list them in parentheses. You can also specify a range like 200/300 for all values between 200 and 300.
margins, at(displacement==(200 300))
If you want to look at the marginal effect of a variable, or the derivative of the outcome with respect to that variable, use the dydx() option.
margins, dydx(displacement)
In this simple case, the derivative is just the coefficient on displacement. But recall that we included both weight and weight squared. Thus:
margins, dydx(weight)
is much more useful. Keep in mind that the result is a function of weight. The above was calculated at the mean, but you might also want to try different values of weight with at().
Once you get into nonlinear models like logit or mlogit the value of margins with dydx() is even greater.
Next: Working with Groups
Previous: Working With Data
Last Revised: 10/6/2009
