Introduction to Stata: Statistics

This is part five of Introduction to Stata. If you're new to Stata we highly recommend starting from the beginning.

Skip to the content following this video

In this section we'll discuss two of the most basic and useful statistical commands. You can do a great deal of valuable work with these commands, but our primary goal will be to help you understand how the syntax elements you learned earlier can be combined with statistical commands to do analysis.

Start by creating a do file that loads the auto data set:

capture log close
log using stats.log, replace

clear all
sysuse auto

Summary Statistics for Continuous Variables

summarize (sum) gives you summary statistics which will help you understand the distribution of continuous (quantitative) variables . With just:

sum

you will get basic summary statistics for all the variables in your data set. Run the do file by pressing Ctrl-d or clicking the "play" button at the top right, and take a look at the output. Note that there is nothing for make: it is a string variable so summary statistics don't make sense. Also note that for rep78 the number of observations is 69 rather than 74. That's because five missing values were ignored and the summary statistics calculated over the remaining 69 values of rep78. Most statistical commands take a similar approach to missing values and that's usually what you want, so you rarely have to include special handing for missing values in statistical commands.

On the other hand, rep78 is a categorical variable, so these summary statistics don't make a lot of sense for it.

All the syntax elements you learned earlier also work with statistical commands. To get summary statistics for just mpg, give sum a varlist:

sum mpg

If you want summary statistics for just the foreign cars, add an if condition:

sum mpg if foreign

If you want summary statistics of mpg for both foreign and domestic cars, calculated separately, use by:

by foreign: sum mpg

This is one way to compare the two groups.

The detail (d) option will give more information. Try:

sum mpg, detail

Run your do file and examine the results.

Exercise: Find the mean price of cars that get more than 25 miles per gallon. Now compare that with the mean price of cars that get 25 miles per gallon or less. Does this mean American consumers in 1978 considered high gas mileage a negative characteristic?

Frequencies for Categorical Variables

tabulate (tab) will create tables of frequencies, which will help you understand the distribution of categorical variables. It can also be useful for string variables that describe categories or groups.

If you give tab a variable list with one variable it will give you a one-way table, while if you give it two variables it will give you a two-way table (i.e. crosstabs). To get an idea of what tab does, add the following to your do file and run it:

tab rep78
tab rep78 foreign

Tables are usually easier to read if the variable with the most unique values comes first, so they're listed vertically.

The tab command won't accept more than two variables, but you can also focus on one group at a time by adding an if condition:

tab headroom rep78 if foreign

You can also use by, but the results are often hard to read.

The tab command has a rich set of useful options. The missing values of rep78 were not included in the table, which makes it easy to forget they're there. Add them with the missing option:

tab rep78, missing

By default tab will show value labels, but you can override this with the nolabel option. A quick and easy way to find the values underneath the value labels is to run two tab commands, one without nolabel and one with it:

tab foreign
tab foreign, nolabel

The label list command will give you that information in a form that's much easier to read—unless your data set has a large number of labels (as many do). Then you'll want to limit the list to just the label that's been applied to the variable you want. The value label applied to foreign is called origin, so you'd run label list origin. The advantage of tab with nolabel is you don't need to know the label name.

Going back to tab, to get percentages in a two-way table add the row, column, or cell options:

tab rep78 foreign, row column cell

For this table, row answers the question "What percentage of cars with a rep78 of one are domestic?" while column answers "What percentage of domestic cars have a rep78 of one?" and cell answers "What percentage of all the cars are both domestic and have a rep78 of one?" Usually you are only interested in one of them.

tab has an option called sum which gives summary statistics for a given variable, calculated over the observations in each cell of the table. Try:

tab foreign, sum(mpg)

There's also a chi2 option that runs a chi-squared test on a two-way table:

tab rep78 foreign, chi2

Run your do file and examine the results.

Exercise: Use tab to compare the mean value of price associated with each rep78 category. What relationship do you see? How does that relationship change if you examine foreign and domestic cars separately? (Ignore for the moment the small sample sizes for many of the cells.)

This do files does not make any changes to the data set it uses, so there's no need to save a new version of it. But to finish your do file properly have it close its log:

log close

Now that you understand the basics of how statistical commands work in Stata, learning more of them will be easy.

Next: Creating and Changing Variables

Last Revised: 5/27/2020