Stata for Researchers: Usage and Syntax

This is part two of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

This article will introduce Stata's user interface and teach you its basic syntax. Understanding Stata's syntax is the key to becoming an expert Stata user.

Running Stata

To start Stata on Winstat or another Windows computer, click the Windows logo button, All Programs, Stata 14 and then Stata MP 14. On Linstat, type xstata.

Stata's User Interface

When you start up Stata, the first thing you'll see is the main user interface window.

Most of it is self-explanatory. A few tricks that will make you more efficient:

  • Press Page Up to retrieve your last command; press it more than once to retrieve earlier commands.
  • Click on a command in the Review window and it will be pasted into the Command window for editing. Double click on a command and it will be executed again.
  • If you decide something you did interactively was useful enough to preserve, right-click on the commands in the Review window, choose Send selected to Do-file Editor, edit as needed, and then save.
  • If you start typing the name of a variable and then press Tab, Stata will fill in the rest of the variable name if you've typed in enough for Stata to identify it.
  • Click on a variable name in the Variables window and it will be pasted into the Command window at the current location of the cursor.
  • Press q or click on the circled-X button at the top to interrupt a command in progress (the button turns red when something is running)
  • Use the Properties window do learn about your data set, the individual variables it contains, and how much memory Stata is using. Stata must load your entire data set into memory, but if you try to use more memory than your computer has Windows will use disk space as memory and Stata will become extremely slow.

Start up Stata on the computer you're using. You should be seeing the graphical user interface just like the picture above.

Directories and Files

Stata always has exactly one data set in memory, and the commands you issue will act on it. To open a data set you may be tempted to click File, Open, but you can't write that into a do file. Thus we'll stick to typing commands.

Stata always keeps track of what it calls the "current working directory." Think of it as where you "are" in your computer's file system. Any commands that work with files will assume that the files are in the current working directory unless you specify otherwise. You set the current working directory with the cd (change directory) command. The syntax is simply:

cd directory

with one caveat: if the directory name has any spaces in it the entire directory name must go in quotes so Stata understands it's just one entity.

If the directory name starts with a drive letter (or in Linux if it starts with "/") then Stata will go directly to that location no matter what the current working directory is. If you placed the Stata for Researchers example files in U:\StataResearch make that the current working directory by typing:

cd U:\StataResearch

If you placed it in a different directory, type that instead. Don't forget that if you put spaces in the name you need to put quotes around it (e.g. cd "U:\Stata Research"). Note that in Windows file and directory names are not case sensitive (i.e. you could have typed cd u:\stataresearch) but in Linux they are.

You can also specify directories relative to the current directory. For example, .. means "up one level from the current directory." Assuming you're now in U:\StataResearch, typing:

cd ..

will put you in U:\.

If the directory in your cd command does not start with a drive letter ("/" in Linux) then it is assumed to be inside the current directory. Type:

cd StataResearch

to move from the U:\ drive to U:\StataResearch.

Next see what's here using the ls (list) command. Type:

ls

The file you want is called auto.dta (.dta is the standard extension for Stata data sets). To load it type:

use auto

You don't have to type the .dta; Stata will assume it.

You can specify the full location of a data set in the use command, e.g.:

use U:\StataResearch\auto

With this method you don't need to use cd to set the current directory, but you have to specify the full location of every file you work with. You can also use relative paths in commands: if the current working directory had a subdirectory called data you could type use data\mydataset to load a data set inside it.

Our suggestion is that you make a directory for each project you're working on, and keep all the data files, do files, log files and any other files related to the project in its directory. Make that directory the current working directory whenever you're working on that project. If you double-click on a data set or do file in Windows Explorer, Stata will start with the current working directory already set to the location of the file. Then in your programs you can refer to files simply by name without having to specify any locations. This is quicker than typing out the location of each file and reduces opportunities for error. It also makes your project portable: you can move the entire directory to a different computer (even from Windows to Linux) or zip it up and send it to someone else, and all your programs will still work.

If you're inclined to create subdirectories to organize your files, consider using Stata's Project Manager instead. The Project Manager allows you to define groups and place files in them without actually changing their location on disk. That way you can reorganize your files without changing the programs that use them. We won't discuss the Project Manager, but it's very easy to use.

The 'auto' Example Data Set

The 'auto' data set contains information about 1978 cars and has been included with Stata for many, many years. Every Stata user has access to it so it is frequently used for examples, just as we'll use it today. To see what's in it, type:

browse

or click the fourth button from the right in the toolbar at the top. This opens Stata's Data Editor, which shows you your data set in a spreadsheet-like form. You can also invoke the Data Editor by typing edit or clicking the fifth button from the right, and then it will allow you to make changes. Since you should never change your data interactively, get in the habit of using browse so you don't make changes by accident. Before proceeding there are a few things you should note about this data set.

Numbers vs. Strings

Most of the variables in this data set are numbers, like price and mpg. The variable make contains words or, as Stata calls them, "strings" (as in strings of characters). Obviously you can't do math with words, but Stata can do many other useful things with string variables.

Missing Values

Several cars have dots in the rep78 column rather than numbers. These indicate missing values. A Stata data set is a rectangular matrix, so every observation must have something for every variable. If no actual data are available, Stata stores a code for "missing."

Missing values often require special handling, and it's easy to write code which works fine with complete data but gives wrong answers if there are any missing values. The worst part about missing values is there's no single right way to deal with them—it depends on what you're trying to do. How to handle missing data will be a recurring theme in this series.

While this data set just uses "generic" missing values, there are 26 others you can use: .a through .z. Stata treats them all the same, but you can assign meanings to them. For example, if you were working with a survey you might decide to code "the question did not apply" as .a and "the respondent refused to answer" as .b.

Value Labels

The foreign variable appears to contain text, like make. But note that it's a different color, and if you click on a cell in that column what appears at the top of the window is a 0 or a 1. This tells you foreign is really an numeric variable with a set of value labels applied. Comparing the numbers at the top with the words in the table, you'll see that this set of value labels associates the number 0 with the word Domestic and the number 1 with the word Foreign. We'll talk about creating value labels in Working with Data. But for now, the important thing to remember is that if you write code referring to the foreign variable, it must use the real values 0 and 1, not the labels Domestic and Foreign. This frequently confuses people who are starting out with a new data set, so it's worth checking to see which variables have associated value labels. In addition to opening the data browser and looking for columns in blue, you can look at the Properties window (after selecting the variable of interest in the Variables window) or type:

describe

(or just d). This will give you information about all the variables in your data set, including a column that tells you which ones have value labels. You can get the same information and more by opening the Variables Manager window (third button from the right or type varman). You can see what the labels are by typing:

label list

Syntax Elements

Almost all Stata commands rely on the same set of syntax elements. These elements give you a tremendous amount of control over the commands you run. Very complicated instructions can be expressed relatively simply by the proper combination of syntax elements.

We'll discuss four elements:

  • varlists
  • if
  • options
  • by

These elements always go in the same order:

[by varlist:] command [varlist] [if condition] [, options]

In order to see these elements in action, we'll use a very simple command:

list

(list can also be abbreviated as just l.) This prints your data on the screen—think of it as a browse you can use in do files. However, list all by itself produces so much output that it's hard to find what you want. By learning to use these syntax elements you'll learn to list just the information you want—and in the process learn to control what any Stata command does.

Varlists

Varlists allows you to control which variables (columns) a command will act on. A varlist is simply a list of variables separated by spaces, and it goes right after the command itself. Type:

list make

This lists just the make of each car rather than all the variables.

As the name suggests, a varlist can include multiple variables. Try typing:

list make price mpg

Varlists can get quite long, so there are several shortcuts for writing them. If you put a dash between two variables, all the variables between them (as defined by the order they're listed in the Variables window) will be included in the variable list. Thus:

list make-mpg

includes price, because the first three variables in the data set are make, price and mpg.

You can also use wildcard characters. A * matches any number of characters, so

list m*

gives you both make and mpg. It would also include a variable just called m if there were one in the data set. A ? matches any one character, but it must be exactly one. Thus:

list x?

would list (if our data set had such variables) x1, x2, and x3, but not x, x10 or xenophobia.

Wildcards can go in any location. For example,

list *t

lists all variables that end in t (weight and displacement) while

list t*n*

lists all variables that start with t and then have an n in any other position (trunk and turn).

You can mix shortcut types:

list m* weight-displacement *n

Exercises

  1. What is the most concise varlist that would include all the variables in this data set? (Solution)
  2. Suppose your data set included variables x1 through x20 plus many others, including one called xenophobia. What varlist would select x1 through x20 but not those other variables, in particular not xenophobia? Assume the variables are intermingled so x1-x20 will not work. (Solution)

If

If conditions allow you to control which observations (rows) a command acts on. An if condition consists of the word if followed by some condition that is either true or false. It comes after the varlist if there is one (if not it goes directly after the command). The command will only act on those observations where the condition is true. Type:

list make mpg if mpg==25

This gives you a list containing the make and mpg of just those cars which get exactly 25 miles per gallon.

Make sure you type two equals signs. Stata, like most computer languages, understands two different meanings for "equals." One equals sign is for assignment: mpg=25 means "Make mpg 25." Two equals signs is a question: mpg==25 asks "Is mpg 25?" This will drive you crazy for about a week and then become second nature. The following are the "relational operators" used to form conditions:

==

Equal

> Greater than
< Less than
>= Greater than or equal to
<= Less than or equal to
!= Not equals

The exclamation point also means "not" more generally. For example, try:

list make mpg if !(mpg>25)

"Not" can be thought of as reversing the condition that follows it: changing false to true and true to false.

Combining Conditions

You can combine two conditions with "logical and" (&) or "logical or" (|, called the "pipe" and created by pressing Shift-\). With "logical and" the result is true if and only if both conditions are true, while with "logical or" the result is true if either condition is true, or both. In terms of set theory, "logical and" is the intersection, while "logical or" is the union.

Thus:

list make price mpg if mpg>25 & price<5000

will give you a list of cars that both get good gas mileage and are relatively cheap (remember these are 1978 dollars), while:

list make price mpg if mpg>25 | price<5000

will give you a list of cars that either get good gas mileage or are relatively cheap or both, a much larger list.

Once you get past two conditions, the order in which they're evaluated can change the result. There are set precedence rules, but we suggest using parentheses liberally to ensure Stata will evaluate conditions in the order you think it should.

Indicator Variables

While we've talked about conditions being true or false, in reality Stata uses numbers: one is true and zero is false. You can take advantage of this to write very natural if conditions with indicator variables:

list make if foreign

Part of what makes this work is the variable name: the variable foreign tells us whether or not a car is in fact foreign. Now consider a variable called gender: does a one indicate that a person is male or that a person is female? You can't tell without checking your codebook (or value labels). But if the variable were called female it would be obvious that a one means this person is female.

Be careful however: more generally, any number other than zero is considered true—even missing. If foreign were missing for some cars, list make if foreign would treat those cars as if they were known to be foreign. This leads some people to suggest that it's better to always use code like:

list make if foreign==1

But this treats cars with a missing value for foreign as if they were known not to be foreign. Which one is appropriate—or whether you need to do something else entirely—depends on what exactly you're trying to do.

Missing Values and Inequalities

Internally, Stata stores the missing values ., .a, .b ... .z as the 27 largest possible numbers of each variable type, and in that order. It's very important to keep this in mind when dealing with inequalities: think of missing values as essentially "positive infinity."

Consider making a list of cars with "good" repair records, defined as rep78 greater than three:

l make rep78 if rep78>3

Cars with a missing value for rep78 are included, because infinity is much greater than three.

Whether that's a problem or not depends on your goal in making this list. If you want a list of cars which are not known to have poor repair records, that code is entirely correct. But if you want a list of cars which are known to have good repair records then you need to add a second condition:

l make rep78 if rep78>3 & rep78!=.

Now cars with missing repair records are specifically excluded. However, note that this would not exclude the other missing values: .a, .b, etc. Thus veteran Stata programmers will use:

l make rep78 if rep78>3 & rep78<.

The generic . is the smallest of the missing values, so rep78<. will be true for all valid values and false for all missing values.

An alternative that's longer to type but easier to read is:

l make rep78 if rep78>3 & !missing(rep78)

The missing function takes a variable name as input, and returns true if that variable is missing and false if it is not. (Alternatively you can give it a list of variables separated by commas and it will return true if any of them are missing.) The exclamation point reverses the result, making it effectively "not missing."

Since the "missing is infinity" rule is not intuitive, it's easy to forget. But doing so can be disastrous. Consider trying to identify senior citizens with the condition if age>=65: anyone whose age is missing would be called a senior citizen. Different rules for handling missing values have been proposed, but they just make the disasters occur under different circumstances. There's no alternative to checking for missing values and handling them properly where they exist.

Fortunately the checking is easy to do. Just type:

misstable sum

This is an example of a command with a subcommand. The misstable command can do many things, so the second word (or in this case abbreviation since sum is short for summarize) tells Stata you want it to give you a summary of the variables that have missing values. Since rep78 is the only variable listed, you now know that you have to worry about missing values when working with rep78, but can ignore them with all the other variables—as long as the data set doesn't change.

If you will be working with many data sets or data sets that do change, you can build a test for missing values right into your do files. The assert command checks to see whether a given condition is true or not for all observations. For example:

assert mpg<.

checks to see if mpg is always non-missing. Since it is, nothing happens. However, try:

assert rep78<.

Now you get an error message (along with information about how often the condition is violated). If you had been running a do file it would have come to a screeching halt. This is good: if your code was written on the assumption that rep78 is never missing, it's far better for that code to crash than to continue running and give you wrong answers.

assert is useful for far more than checking for missing values. With clever programming you can use it to check all sorts of assumptions about your data ("Each value of ID is associated with just one person", "Every household has a head of household", etc.). Doing so can save you a lot of headaches.

Exercises

  1. A shipping company would like to ship cars from other countries to the US. Its ships can handle cargoes up to 200 inches in length and weighing up to 4,000 pounds. Doing so is only profitable for cars costing at least $5,000. Which cars should it consider shipping? (Solution)
  2. Suppose I want to buy a car. I'm willing to pay up to $4,000 for most cars, but I'll go up to $5,000 if the car is known to be reliable (rep78>3) and gets good gas mileage (mpg>25). Which cars should I look at? (Solution)
  3. Recall that we earlier typed list make mpg if !(mpg>25). Now try list make mpg if !mpg>25. What's going on? (Solution)

Options

Options control how a command runs. They go at the end of the command after a comma. There can be any number of options (all following a single comma) and they can go in any order. If an option needs more information, like a variable to act on, that information goes in parentheses immediately following the option. To see an option in action, type:

list make foreign

and then

list make foreign, nolabel

The nolabel option can be used with many commands. It tells the command to ignore value labels and show the actual values of the variable. Other options may only be used by a single command.

The list command has a header option that prompts it to repeat the header row containing variable names. This can make it easier to keep track of what the columns mean in long lists. However, it needs to know how how many observations should go between each repetition of the header. As additional information, this number goes in parentheses after the word header:

list make-mpg, header(20)

By

by is used to run a command separately across groups. For example, list the domestic cars and the foreign cars separately by typing:

by foreign: l make foreign

Note how the list is broken into two parts. The first one says foreign=Domestic at the top, the second says foreign=Foreign. by splits the data set up into groups, one group for each unique value of the by variable, then executes the command for each group independent of the others.

Since by takes a varlist, you can work by more than one variable at a time. Try:

by foreign rep78: l make

You'll get the message

Stata can only use by if the data set is sorted in the order of the by variables. This data set started out sorted by foreign, but not by rep78. Type:

sort foreign rep78

and Stata will sort the data and allow you to execute:

by foreign rep78: l make

successfully. As you can see it breaks the data set into one group for each unique combination of foreign and rep78 and then carries out the command. Note that some combinations (foreign cars with a rep78 of one, for example) do not occur in the data and are not listed.

Users got tired of forgetting to sort before using by, so Stata added bysort:

bysort foreign rep78: l make

This will first sort the data by foreign and rep78, then carry out the rest of the command.

A caution about sorting: Stata's default sort algorithm is not stable, meaning that it may change the order of the observations even if it doesn't have to. For example, if you have data consisting of individuals grouped into households, running sort household may change the order of individuals within a household. If the order is important, add the stable option to the sort command (e.g. sort household, stable) and Stata will switch to a slower algorithm that is stable.

Exercise

  1. make appears to be a unique identifier in this data set (i.e. each car has a unique value of make). If so, what would it look like if you did a list "by make"? Do the list; is it what you expected? What would it look like if make were not in fact a unique identifier? (Solution)

We'll do much more with by in Working with Groups.

Next: Do Files

Previous: Introduction

 

Last Revised: 1/4/2016