Stata for Researchers: Usage and Syntax

This is part two of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

This article will introduce Stata's user interface and teach you its basic syntax. Understanding Stata's syntax is the key to becoming an expert Stata user.

Running Stata

To start Stata on Winstat or another Windows computer, click the Windows logo button, All Programs, Stata 12 and then Stata MP 12. On Linstat, type xstata.

Stata's User Interface

When you start up Stata, the first thing you'll see is the main user interface window.

Stata's Main Window

Most of it is self-explanatory. A few tricks that will make you more efficient:

  • Press Page Up to retrieve your last command; press it more than once to retrieve earlier commands
  • Click on a command in the Review window and it will be pasted into the Command window for editing. Double click on a command and it will be executed again
  • Click on a variable name in the Variables window and it will be pasted into the Command window at the current location of the cursor
  • Press q or click on the circled-X button at the top to interrupt a command in progress (the button turns red when something is running)
  • Use the Properties window do learn about your data set, the individual variables it contains, and how much memory Stata is using.

Getting Started

Start up Stata on the computer you're using. You should be seeing the graphical user interface just like the picture above.

Finding and Opening Data

Stata always has exactly one data set in memory, and the commands you issue will act on it. To open a data set you may be tempted to click File, Open, but you can't write that into a do file. Thus we'll stick to typing commands.

Stata always keeps track of what it calls the "current working directory." Think of it as where you "are" in your computer's file system. Any commands that work with files will assume that the files are in the current working directory unless you specify otherwise. You set the current working directory with the cd (change directory) command. The syntax is simply:

cd directory

with one caveat: if the directory name has any spaces in it the whole thing must go in quotes so Stata understands it's just one entity.

Stata Corp. thoughtfully included a sample data set with the Stata program and we'll use it extensively. Start by using the cd (change directory) command to navigate to the directory where Stata is installed. On Winstat, that's c:\program files (x86)\stata12. On other computers, it's c:\program files\stata12. However, there's a space between "program" and "files" so the command must be:

cd "c:\program files (x86)\stata12"

On Linstat, Stata is installed in /usr/local/stata, so you'd type:

cd /usr/local/stata

Since there are no spaces there's no need for quotes. (This is the only command in this entire article which is different when using Linux Stata rather than Windows Stata.)

Next see what's here using the ls (list) command. Type:

ls

The file you want is called auto.dta (.dta is the standard extension for Stata data sets). To load it type:

use auto

You don't have to type the .dta; Stata will assume it.

You can specify the full location of a data set in the use command, e.g.:

use "c:\program files (x86)\stata12\auto"

With this method you don't need to use cd to set the current directory, but you have to specify the full location of every file you work with.

If you had trouble locating Stata on your computer, the sysuse command loads data from wherever Stata is installed and works on all systems. Just type:

sysuse auto

(You'll never use sysuse for real work.)

Stata can open a data set from the web as easily as from your local hard drive. You can get this exact same data set from Stata's web site by typing:

use http://www.stata-press.com/data/r12/auto.dta

Our suggestion is that you make a directory for each project you're working on, and keep all the data files, do files, log files and any other files related to the project in its directory. When you start Stata, cd to the directory of the project you want to work on and you can refer to its files simply by name from then on. Since your current "project" is the Stata for Researchers series, set the current working directory to wherever you put the example files. If you put them in U:\StataResearch, the command would be:

cd U:\StataResearch

The 'auto' Example Data Set

This data set of 1978 cars has been included with Stata, and thus accessible to all Stata users, for years. It's frequently used for examples throughout the Stata world, just as we'll use it today. To see what's in it, type:

browse

or click the fourth button from the right in the toolbar at the top. This opens Stata's Data Editor, which shows you your data set in a spreadsheet-like form. You can also invoke the Data Editor by typing edit or clicking the fifth button from the right, and then it will allow you to make changes. Since you should almost never change your data interactively, get in the habit of using browse so you don't make changes by accident. Before proceeding there are a couple of things you should note about this data set.

Numbers vs. Strings

Most of the variables in this data set are numbers, like price and mpg. The variable make contains words or, as Stata calls them, "strings" (as in strings of characters). Obviously you can't do math or statistics with words, but Stata can do many other useful things with string variables. Note that strings can contain numbers: you can put "123" in a string, but Stata will think of it as the character '1' followed by the character '2' followed by the character '3', not the value one hundred twenty-three. We'll talk more about strings in Working with Data.

Missing Values

Several cars have dots in the rep78 column rather than numbers. These indicate missing values. A Stata data set is a rectangular matrix, so every observation must have something for every variable. If no real data are available, Stata stores a special code for "missing."

Missing values often require special handling, and it's easy to write code which works fine with complete data but gives wrong answers if there are any missing values. We'll discuss how to handle missing values properly as we go, but the first step is to be aware of them. The misstable summarize command will tell you which variables have missing values and how many.

While this data set just uses "generic" missing values, there are 26 others you can use: .a through .z. Stata treats them all the same, but you can assign meanings to them. For example, if you were working with a survey you might decide to code "the question did not apply" as .a and "the respondent refused to answer" as .b.

Value Labels

The foreign variable appears to contain text, like make. But note that it's a different color, and if you click on a cell in that column what appears at the top of the window is a 0 or a 1. This tells you foreign is really an numeric variable with a set of value labels applied. You can see the real values by typing:

browse, nolabel

As you see, this set of value labels associates the number 0 with the word Domestic and the number 1 with the word Foreign. We'll talk about creating value labels in Working with Data. But for now, the important thing to remember is that if you write code referring to the foreign variable, it must use the real values 0 and 1, not the labels Domestic and Foreign. This frequently confuses people who are starting out with a new data set, so it's worth checking to see which variables have associated value labels. In addition to opening the data browser and looking for columns in blue, you can look at the Properties window (after selecting the variable of interest in the Variables window) or type:

describe

(or just d). This will give you information about all the variables in your data set, including a column that tells you which ones have value labels. You can get the same information and more by opening the Variables Manager window (third button from the right or type varman). You can see what the labels are by typing:

label list

Syntax Elements

Almost all Stata commands rely on the same set of syntax elements. These elements give you a tremendous amount of control over the commands you run. Very complicated instructions can be expressed relatively simply by the proper combination of syntax elements.

We'll discuss five elements:

  • options
  • varlists
  • if
  • in
  • by

These elements always go in the same place in a command, though no command will use all of them:

[by:] command [varlist] [if/in] [, options]

In order to see these elements in action, we'll use a very simple command:

list

(It can also be abbreviated as just l.) This prints your data on the screen--think of it as a browse you can use in do files. However, list all by itself produces so much output that it's hard to find what you want. By learning these syntax elements you'll learn to list just the information you want--and in the process learn to control what any Stata command does.

Options

You've already seen one option: nolabel. There are many commands that take the nolabel option, while some options only apply to one command. Estimators in particular tend to have lots of options for controlling exactly how the estimation is carried out.

The syntax for using options is very simple: they go at the end of the command, following a comma. There can be any number of options (all following a single comma) and they can go in any order. If an option needs more information, like a variable to act on, it goes in parentheses immediately following the option.

For example, adding the mean option to list causes it to add a line to the end of the list giving the means of the (numeric) variables.

l, mean

But if you only wanted to see the means of some variables, you'd list them in parentheses right after mean:

l, mean(mpg)

Varlists

A varlist is simply a list of variables separated by spaces. If you give a command a varlist the command will be executed just for the variables in the varlist. Type:

l make

This lists just the make of each car.

As the name suggests, a varlist can include multiple variables. Try typing:

l make price mpg

Varlists can get quite long, so there are several shortcuts for writing them. If you put a dash between two variables, all the variables between them (as defined by the order they're listed in the Variables window) will be included in the variable list. Thus:

l make-mpg

includes price, because the first three variables in the data set are make, price and mpg.

You can also use wildcard characters. A * matches any number of characters, so

l m*

gives you both make and mpg. It would also include a variable just called m if there were one in the data set. A ? matches any one character, but it must be exactly one. Thus:

l x?

would list--if our data set had such variables--x1, x2, and x3, but not x, x10 or xenophobia.

Wildcards can go in any location. For example,

l *t

lists all variables that end in t (weight and displacement) while

l t*n*

lists all variables that start with t and then have an n in any other position (trunk and turn).

You can mix shortcut types:

l m* weight-displacement *n

Exercises

  1. What is the most concise varlist that would include all the variables in this data set? (Solution)
  2. Suppose your data set consisted of variables x1 through x20 plus one called xenophobia. What varlist would select all but xenophobia? (Solution)

If

While a varlist says "only act on the variables (columns) in this list" an if condition says "only act on the observations (rows) where this condition is true." Thus what follows must be a condition that is either true or false. Type:

l make mpg if mpg==25

This gives you a list containing the make and mpg of just those cars which get exactly 25 miles per gallon. Note that a command can have both a varlist and an if condition, but the varlist comes first.

Make sure you type two equals signs. Stata, like most computer languages, understands two different meanings for "equals." One equals sign is for assignment: mpg=25 means "Make mpg 25." Two equals signs is for testing: mpg==25 asks "Is mpg 25 or not?" This will drive you crazy for about a week and then become second nature. The following are the "relational operators" used to form conditions:

==

Equal

> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
!= Not equals

The exclamation point also means "not" more generally. For example, try:

l make mpg if !(mpg>25)

"Not" can be thought of as reversing the condition that follows it: changing false to true and true to false.

Combining Conditions

You can combine two conditions with "logical and" (&) or "logical or" (|, called the "pipe" and created by pressing Shift-Backslash). With "logical and" the result is true if and only if both conditions are true, while with "logical or" the result is true if either condition is true, or both. In terms of set theory, "logical and" is the intersection, while "logical or" is the union.

Thus:

list make price mpg if mpg>25 & price<5000

will give you a list of cars that both get good gas mileage and are relatively cheap (remember these are 1978 dollars), while:

list make price mpg if mpg>25 | price<5000

will give you a list of cars that either get good gas mileage or are relatively cheap or both, a much larger list.

Once you get past two conditions, the order in which they're evaluated can change the result. There are set precedence rules, but we suggest using parentheses liberally to ensure Stata will evaluate conditions in the order you think it should.

Indicator Variables

While we've talked about conditions being true or false, in reality Stata uses numbers: one is true and zero is false. You can take advantage of this to write very natural if conditions with indicator variables:

l make if foreign

Part of what makes this work is the variable name: the variable foreign tells us whether or not a car is in fact foreign. Now consider a variable called gender: does a one indicate that a person is male or that a person is female? You can't tell without checking your codebook (or value labels). But if the variable were called female it would be obvious that a one means this person is female.

More generally, any number other than zero is considered true--even missing. Thus:

l make if rep78

will list all the observations in the data set, including those with a missing value of rep78, because rep78 is never zero. It's rare that you'd want to use anything but an indicator variable in this way, but it's a common mistake to forget about missing values of indicator variables. If foreign were missing for some cars, if foreign would treat those cars as if they were known to be foreign.

Missing Values and Inequalities

When it comes to if conditions with inequalities, missing values are treated like very large numbers--essentially infinity. (In fact missing values are stored as the largest possible numbers given the type of variable.) Consider making a list of cars with "good" repair records, defined as rep78 greater than three:

l make rep78 if rep78>3

Cars with a missing value for rep78 are included, because infinity is much greater than three.

Whether that's a problem or not depends on what you wanted. Did you want a list of cars which are not known to have poor repair records, or a list of cars which are known to have good repair records? Assuming it's the latter, the solution is to add a second condition:

l make rep78 if rep78>3 & rep78!=.

Now cars with missing repair records are specifically excluded. However, note that this would not exclude the other missing values: .a, .b, etc. These are stored as even larger numbers, so you can exclude all missing values with:

l make rep78 if rep78>3 & rep78<.

An alternative that's longer to type but easier to read is:

l make rep78 if rep78>3 & !missing(rep78)

The missing function takes one argument, a variable, and returns true if that variable is missing and false if it is not. The exclamation point then reverses it, making it "not missing."

Since the "missing is infinity" rule is not intuitive, it's easy to forget. But the results can be disastrous. Consider trying to identify senior citizens with the condition if age>=65: anyone whose age is missing would be called a senior citizen. Different rules for handling missing values have been proposed, but they just make the disasters occur under different circumstances. There's no alternative to checking for missing values and handling them properly where they exist.

Fortunately the checking is easy to do. Just type:

misstable sum

(sum is short for summarize). Since rep78 is the only variable listed, you now know that you have to worry about missing values when working with rep78, but can ignore them with all the other variables--as long as your data don't change.

If you will be working with many data sets or data sets that do change, you can build a test for missing values right into your do files. The assert command checks to see whether a given condition is true or not for all observations. For example:

assert mpg<.

checks to see if mpg is always non-missing. Since it is, nothing happens. However, try:

assert rep78<.

Now you get an error message (along with information about how often the condition is violated). If you had been running a do file it would have come to a screeching halt. This is good: if you had written code on the assumption that rep78 is never missing, it's far better for that code to crash than to continue running and give you wrong answers.

assert is useful for far more than checking for missing values. With clever programming you can use it to check all sorts of assumptions about your data ("Each value of ID is associated with just one person", "Every household has a head of household", etc.). Doing so can save you a lot of headaches.

Exercises

  1. A shipping company would like to ship cars from other countries to the US. Its ships can handle cargoes up to 200 inches in length and weighing up to 4,000 pounds. Doing so is only profitable for cars costing at least $5,000. Which cars should it consider shipping? (Solution)
  2. Suppose I want to buy a car. I'm willing to pay up to $4,000 for most cars, but I'll go up to $5,000 if the car is known to be reliable (rep78>3) and gets good gas mileage (mpg>25). Which cars should I look at? (Solution)
  3. Recall that we earlier typed l make mpg if !(mpg>25). Now try l make mpg if !mpg>25. What's going on? (Solution)

In

Like if, in allows you to specify which observations are acted on. But with in you just specify them by observation number.

l make in 1

lists the make of the first observation. If you want to list a range of observations, type the first number, followed by a slash, followed by the last number:

l make in 1/3

Negative numbers are understood as counting from the end of the data set rather than the beginning. Thus:

l make in -3/-1

lists the last three cars in the data set.

You'll almost certainly use if far more than in. But in is very handy for checking your work. After running a do file that manipulates your data, it's a good idea to spot check the results by typing:

l in 10
l in 20

etc., then making sure the results are correct.

Exercise

  1. Observation numbers are more useful if the order of observations means something, so type sort mpg. Now they're listed in ascending order by mpg. What are the ten most fuel efficient cars? When you're done type sort foreign to put the data back in the original order. (Solution)

By

by is used to run a command separately across groups. For example, list the domestic cars and the foreign cars separately by typing:

by foreign: l make foreign

Note how the list is broken into two parts. The first one says foreign=Domestic at the top, the second says foreign=Foreign. by splits the data set up into groups, one group for each unique value of the by variable, then executes the command for each group independent of the others.

Since by takes a varlist, you can work by more than one variable at a time. Try:

by foreign rep78: l make

You'll get the message

Stata can only use by if the data set is sorted in the order of the by variables. This data set started out sorted by foreign, but not by rep78. Type:

sort foreign rep78

and Stata will sort the data and allow you to execute:

by foreign rep78: l make

successfully. As you can see it breaks the data set into one group for each unique combination of foreign and rep78 and then carries out the command. Note that some combinations (foreign cars with a rep78 of one, for example) do not occur in the data and are not listed.

Users got tired of forgetting to sort before using by, so Stata added bysort:

bysort foreign rep78: l make

This will first sort the data by foreign and rep78, then carry out the rest of the command.

Exercise

  1. make appears to be a unique identifier in this data set (i.e. each car has a unique value of make). If so, what would it look like if you did a list "by make"? Do the list; is it what you expected? What would it look like if make were not in fact a unique identifier? (Solution)

We'll do much more with by in Working with Groups.

Next: Working with Data

Previous: Introduction

 

Last Revised: 8/19/2011