SSCC Publications

An Introduction to Stata

Printer Friendly Version

Last Revised:3/20/2008

Stata is the most popular program for statistical analysis at the SSCC, as it is both extremely powerful and relatively easy to learn. Its straightforward but flexible syntax makes it a good choice for data manipulation and management, and it implements a very large number of statistical models and techniques. Stata also has a an extensive user community which has made a great deal of code available for free, including many additional estimators. We've been quite pleased with Stata at the SSCC, and we think you'll find it extremely useful.

There are two different approaches one can take to Stata. One is to use it as an interactive tool: you start Stata, load your data, and start typing or clicking on commands. This is an excellent way to learn Stata; thus it's how you'll spend most of your time as you work through this publication. It is also a good way to explore your data, figure out what you want to do, and check that your programs worked properly. However, interactive work cannot be easily or reliably replicated, or modified if you change your mind. It's also very difficult to recover from mistakes--there's no "undo" command in Stata.

The other approach is to treat Stata as a programming language. In this approach you write your programs, called do files, and then run them. A do file contains exactly the same Stata commands you'd type in interactive Stata, but since they're all written up in a permanent file they can easily be rerun, modified, checked for errors, or debugged. They also serve as an exact record of how you obtained your results--a sort of lab notebook for the social scientist. I feel very strongly that any work you intend to publish or present should be done using do files. Thus this publication will for the most part ignore Stata's graphical user interface and focus on preparing you to write do files for research.

The goal of this publication is to give you a solid foundation in Stata that you can then build on to become an expert Stata user. If your goal is to learn just enough Stata to get you through a particular course you might want to consider reading something like Alan Acock's book "A Gentle Introduction to Stata" instead.

This publication contains the following sections:

  1. Running Stata
  2. Getting Started
  3. Stata Commands
  4. Working with Data
  5. Commands to Examine Data
  6. Renaming and Labeling Variables
  7. Creating and Modifying Variables
  8. Analysis
  9. Graphs
  10. Do Files
  11. Organizing Your Research Project
  12. Learning More

Running Stata

The SSCC makes Stata available on Winstat and our Linux servers. For details about the capabilities of the SSCC's servers see Computing Resources at the SSCC. You can find out how busy the various Linux servers are by visiting our server status web page (Winstat always directs you to the least busy server). Windows Stata and Linux Stata look and act the same, and you can even write your programs in Windows and run them in Linux--See Running Linux Programs Using Windows (Mostly) for details. Linux Stata is significantly faster however, partly because of the nature of Linux but mostly because the SSCC's Linux servers run Stata/MP which uses multiple processors. You can also submit Stata jobs to the SSCC's Condor flock either from Linux or from the web.

To start Stata on a Winstat, click on Start, Programs, Stata 10 , and then StataSE 10 . To start Stata on a Linux server, type xstata. This requires X-Windows graphics to run. If you're connecting to Linux from a PC or from Winstat you will need to use X-Win32 to display Linux graphics: see Connecting to SSCC Linux Computers using X-Win32.

The Stata Interface

You'll see something like this:

Stata's graphical user interface

The window on the bottom right with no label is where you'll enter commands. When you press Enter, they are pasted into the Results window above. This is where you will see your commands execute and view the results. On the left are two convenience windows. Variables keeps a list of your current variables. If you click on one of them, its name will be pasted into the current command at the location of the cursor, which saves a little typing. The Review window keeps a list of all the commands you've typed this Stata session. Click on one, and it will be pasted into the command window, which is handy for fixing typos. Double-click, and the command will be pasted and re-executed. You can also recall previous commands by pressing Page Up. You can export everything in the Review window into a do file by right-clicking on it, but this includes any mistakes you made.

Getting Started

Start up Stata on the server you've chosen. You should be seeing the graphical user interface just like the picture above.

Memory

Stata loads your entire data set into memory, but by default it sets aside just ten megabytes to store it. This is enough for many data sets (including the trivial one we'll deal with in our examples) but for real work you'll often need to set it much higher. This is done by typing:

set mem size

The default unit for memory sizes is kilobytes, but you'll probably want to use megabytes. Just add "m" to the number. For example to claim 100 megabytes type:

set mem 100m

If you don't know how big your data set is you can find out using Stata's ls command, which we'll discuss shortly. Set Stata's memory to about 25% to 50% larger than the data set you'll be using, depending on what you plan to do with it (in particular, whether you'll be adding new variables).

If you try to set the memory too high, you'll get the message:

op. sys. refuses to provide memory
r(909);

If you get this and really need that much memory, the first thing you should do is switch to Linux because it can provide more memory. If the standard Linux servers fail, go to Falcon. It runs 64-bit Linux and can provide even more memory. If Falcon can't provide enough memory you need to rethink your strategy. Are there variables or observations in your data set that you aren't using? Could you split the data set into sections and process each section separately? This may be a good time to visit the consultant for advice.

Note that when you resize the memory any data currently in memory would be lost. Get in the habit of setting memory first, before loading anything.

Finding and Loading Data

Now you're ready to load some data. Stata can access the entire file system of the computer it is loaded on, and uses Linux-style directory navigation to move around. However, this means that, like Linux, Stata has a hard time with file and directory names that have spaces in them. If the file or directory you need has a space in it, you must put the entire path in quotes. On the other hand, Stata doesn't care if you use forward slashes (/) or backslashes (\) to separate directories.

Stata Corp. thoughtfully includes some sample data with the Stata program and we'll use it extensively. Let's start by using the cd (change directory) command to navigate to the directory where Stata is installed. On Linux that's /software/stata. Type:

cd /software/stata

On the Winstats, you need c:\program files\stata10, and program files has a space in it. So type:

cd "c:\program files\stata10"

If you are not using one of the SSCC's servers Stata may be installed in the same location, or you may need to look around (c:\stata is another popular choice). You can also get the example data directly from Stata's web site, as you'll see in a moment.

Next see what's here using the ls (list) command. Just type:

ls

The file you want is called auto.dta (.dta is the standard extension for Stata data sets). Its size is listed as 5.8k, meaning 5.8 kilobytes, so the default memory size of 10 megabytes is very much more than adequate. To load it type:

use auto

Note that you didn't have to type the .dta; Stata assumed it. There's just one trick to the use command: if you already have data in memory, and if you've made any changes at all since you loaded it, Stata will refuse to replace it with another data set unless you specifically tell it to do so. You can do this in two ways. One is to type clear before typing use, thus removing all the current data from memory. The other is to add the replace option to the use command (more on options in a moment). To do that, type:

use auto, replace

With the replace option, the new data will replace the old with no complaints.

Yes, you can also load data by clicking on File, Open, etc. But when you start writing do files you'll need to use the use command, so you might as well start now.

Stata can open a data set from the web as easily as from your local hard drive. For example, you can get this exact same data set by typing:

use http://www.stata-press.com/data/r10/auto.dta, replace

Stata Commands

Now that you've given a Stata command let's talk about how they work. The general form of a Stata command is this:

[by varlist:] command [varlist] [=expression] [if expression] [in range] [statistical weights] [, options]

Brackets mean that element may or may not be there in a given command. Some commands require some elements or cannot use others. We'll go through most of these elements using the list command as an example. Type:

list

The list command, unsurprisingly, lists your data. You'll get quite a bit to look at even with the small example data set--don't try this with census data! There are ways to list just what you want, but for now quit the current command by pressing q or clicking on the red, stop-sign shaped button with the white X on it near the top of the screen.

list can be abbreviated as just l. As you'll see, many Stata commands have abbreviations.

Varlists

If you give a command a varlist the command will be executed just for the variables in the varlist. Type:

l make

make is one of the variables in this data set. When you type l make it lists just the make of each car.

As the name suggests a varlist can include multiple variables. Try typing:

l make price mpg

If

An if condition specifies which observations the command should act on:

l make mpg if mpg==25

This gives you a list of just those cars which got exactly 25 miles per gallon. Note that you had to type two equals signs. Stata, like most computer languages, understands two different meanings for "equals." One equals sign means assignment: mpg=25 means "make mpg 25." Two equals signs is for testing: mpg==25 asks "is mpg equal to 25 or not?" This will drive you crazy for about a week and then it will become second nature.

Also note the order: l if mpg==25 make won't work.

The exclamation point is used for "not." != means "not equals" but you can also use it by itself. For example, try:

l make mpg if mpg!=25
l make mpg if mpg>25
l make mpg if !mpg>25

The exclamation point can also be thought of as reversing the following condition: changing false to true and true to false.

if conditions can be very complicated and often making a program work will come down to crafting the exact condition that will identify the observations you need. Logical and is denoted by & and logical or is denoted by | (the pipe character, which you get by pressing Shift-\). Use parentheses liberally to avoid getting confused about the precedence of logical operations. Try:

l make if (price<4000) | (price<5000 & mpg>30)

This gives you a list of cars someone might buy if they wanted to pay less than $4000 but were willing to go up to $5000 if the car got more than 30 miles per gallon. (Note that these prices are in 1978 dollars!)

In

in allows you to specify the observations the command should act on by observation number. For example, to see the make of the first three observations, type:

l make in 1/3

1/3 is Stata's shortcut for the integers one, two and three, so what you see are observations one, two and three. If you give negative numbers, Stata will count from the end of your data set. So to see the makes of the last three observations, type:

l make in -3/-1

Note the order: the numbers still go from smallest to largest. This is because in this data set -3 really means observation 72 (third from last) and -1 is really 74 (the last). in is handy if you just want to see a few random observations to check results, but it's especially useful if the order of the observations means something (for example, if the head of a household is always the first observation in the household).

An Aside on Value Labels

To learn by: we'll focus on the foreign variable, but there's something important you need to notice about it first: foreign has value labels assigned to it. If you just type:

l foreign

it appears that foreign is a string of characters just like make. This is deceptive. To see what's really going on add the nolabel option. Options affect how commands are executed. Some are unique to a certain command, but others apply to many commands. As described by the generic command syntax, options always come at the end of the command, following a comma:

l foreign, nolabel

The variable foreign is actually stored as an indicator variable (0 or 1) but a value label has been defined so that 0 is displayed as Domestic and 1 is displayed as Foreign.

Value labels are very convenient, but it's important to notice them. For example,

l make if foreign=="Domestic"

won't work. The syntax is correct (including putting character strings like "Domestic" in quotes) but you'll get a type mismatch because foreign is actually a number and you can't compare a number to a character string. The correct command is:

l make if foreign==0

By:

By: is used to run a command separately for different groups. For example, list the domestic cars and the foreign cars separately by typing:

by foreign: l make foreign

Note how the list is broken into two parts. The first one says foreign=Domestic at the top, the second says foreign=Foreign. By: splits the data set up into separate groups, one group for each unique value of the by: variable, then executes the command for each group.

Since by: takes a varlist, you can use more than one variable at a time. Try both foreign and rep78, a measure of the car's repair record on a five-point scale:

by foreign rep78: l make

You'll get the message

Stata can only use by: if the data set is sorted by the varlist. This data set started out sorted by foreign, but not by rep78. Annoying, but hardly fatal. Type:

sort foreign rep78

and Stata will sort the data and allow you to execute by foreign rep78: l make successfully. As you can see it breaks the data set into one group for each unique combination of foreign and rep78 and then carries out the command.

Users got rather tired of that error message, so Stata provided a shortcut:

bysort foreign rep78: l make

This will first sort the data by foreign and rep78, then carry out the rest of the command.

Working with Data

Now we'll move from Stata commands in and of themselves to how Stata thinks about your data--and how you can take advantage of that in order to make your programs work.

Missing Values

You may not have expected two of the groups in the last list: foreign = Domestic, rep78 = . and foreign = Foreign, rep78 = . . That's because rep78 has a possible value in addition to the numbers one through five: it can be missing, which Stata denotes by a period (.). A Stata data set is just a big matrix. Each observation is a row and each column is a variable. The matrix is always rectangular and can't have "holes" in it, so if an observation doesn't have a value for a variable Stata stores missing (.). Internally, missing is stored as +infinity, which cuts down the special code needed to deal with it but often causes confusion.

When using by: missing is just another value and gets its own by: group as you saw. If you are doing math, any expression that includes a missing value gives a missing value for a result. So if you defined a new variable (we'll learn how in a bit) to be rep78+mpg the new variable would be missing for any observation in which rep78 or mpg is missing.

To see how missing values can confuse people, type in:

l make rep78 if rep78>3

What you might expect is that rep78 would be all 4's and 5's, because rep78 can only be an integer between 1 and 5 inclusive. But missing is coded as +infinity, which is definitely greater than 3. Thus observations with rep78 missing are included in the list. This can cause major headaches if not handled properly. For example, imagine trying to identify people over 65 with the condition if age>65. You'd end up declaring everyone whose age is missing to be "over 65." There have been many proposals to change the way Stata handles missing values in if conditions, but none of them have really been any better. The bottom line is that you have to think about missing values and how to handle them.

The easiest way is to make sure there are no missing values. Stata has a handy command called assert which can do just that. You give it a condition and it will check whether the condition is true for all observations or not. Type:

assert mpg!=.

This asserts that mpg is never missing. When you type it, nothing happens. That means the assertion is true. From now on you never have to worry about the possibility that mpg is missing. Now try:

assert rep78!=.

This time Stata complains at you. If you had put this in a do file your do file would have crashed--which is good. It's better for a do file to crash than for it to do something that doesn't make sense because you erroneously thought there were no missing values in your data.

What you do with the missing values you have depends on what you're trying to do with your data. Let's assume that in this case you wanted a list of cars that were known to have rep78 greater than 3. Thus missing values should be excluded. Then type:

l make rep78 if rep78>3 & rep78!=.

Now you'll get the list of 4's and 5's you wanted before. This has a weakness though: Stata also allows you to track different kinds of missing as .a, .b, up through .z. For example, a survey might have "did not apply", "refused to answer" and "the reason why this one is missing is itself missing" and you could code those three situations as .a, .b and the .c. But since .a is not the same as the generic ., the condition rep78!=. will not exclude the .a's, .b's, etc. Veteran Stata programmers would thus write:

l make rep78 if rep78>3 & rep78<.

.a is not the same as ., but they're both coded as +infinity. Only actual (non-missing) values will meet the condition rep78<.. This way you don't have to worry about the possibility that a .a might wreck your code.

Explicit Indexes

Stata has several powerful tools for accessing specific observations. First of all, most Stata commands are actually loops. The list command, for example, lists the first observation, then the second observation, then the third, etc. As Stata is performing this loop, it keeps track of which observation it's working on in a variable called _n. You're also welcome to use it in your code. For example, type:

l if _n==5

This lists the fifth observation. Which is the least fuel efficient car? Type the following:

sort mpg
l make mpg if _n==1

No particular surprise there. (There were no Hummers in 1978.) How about the most efficient? Now you want the last observation. Stata has another internal variable called _N. It contains either, depending on how you want to think about it, the number of observations in the data set or the observation number of the last observation--it's the same number either way. (Note to veteran programmers: Stata's observation numbers start with one, not zero as you may be used to for arrays.) Thus:

l make mpg if _n==_N

gives you the most efficient car (which is, again, no particular surprise--unless you were expecting today's hybrids to get better mileage than the top car in 1978). You could also sort in descending order using gsort (think generalized sort). gsort works just like sort, except if you put a minus sign in front of a variable it will be sorted in descending order. Thus you could have typed:

gsort -mpg
l make mpg if _n==1

and gotten the same result.

Now, so far you could have done all this this with in (try it!). But explicit indexes can do much more. For one thing, _n and _N take into account by: groups, while by: and in can't be combined. Suppose you wanted to know the most fuel efficient domestic car and the most fuel efficient foreign car. All you have to do is type:

sort foreign mpg
by foreign: l make mpg if _n==_N

In order to use by foreign: the data had to be sorted by foreign, and to get the result you wanted each type of car had to be sorted by mpg. When you started your command with by foreign:, Stata split the data set in two and each group had its own value of _N. _n also starts over from one when the command goes from one by: group to the next. Your sort guaranteed that in both groups the most fuel efficient car was last, or in other words that it had _n==_N.

A challenge for you: in this data set we hope that the make variable uniquely identifies a car, and most data sets have such a putatively unique identifier. The tools we've just learned allow us to make sure this is true with just one command--see if you can think of it.

The answer combines assert, by: and _N (think some more and see if you can get it now). The solution may be tricky to understand but it's easy to use:

bysort make: assert _N==1

bysort make: splits the data set into a separate group for each value of make. The assert then checks that the size of each of these groups is 1. If there were two cars with the same value of make they'd be in the same by: group and that group would have a size of two (and thus _N==2). If every by: group has just one observation in it (_N==1) then make is a unique identifier.

(You could also check this by typing duplicates report make and making sure there are no surplus observations. The duplicates command can do a lot of other things too, but the goal here is to learn how to do useful things with by: and _N.)

That's not all you can do with explicit indexes. Type:

di make[1]

di is short for display, and simply displays something on the screen (it's also one of the few commands that doesn't get executed once for each observation). make[1] means the value of make for observation 1. Try displaying make[5], make[10], as many as you like.

You may have noticed that the mpg data is all integers. That means that while in theory mpg is a continuous variable, in reality many cars have the same value of mpg. Let's make a list of all the cars that have the same value of mpg as some other car:

sort mpg
l make mpg if mpg==mpg[_n-1] | mpg==mpg[_n+1]

How did this work? By sorting the data by mpg, we put all the cars with the same value of mpg next to each other. Thus to check if a given car shares a value of mpg we only have to look at the car before it and the car after it and see if their values of mpg are the same.

Since _n is the observation Stata is currently working on, _n-1 means the observation before the one Stata is currently working on and _n+1 means the observation after. However, note that for the first observation mpg[_n-1] is missing, and for the last observation mpg[_n+1] is missing. If we had a car with a missing value for mpg it would have been sorted to the end (remember missing is +infinity) and then shown up in the list even if no other car was missing mpg, because for it mpg==mpg[_n+1] would be true. We could add some code to handle this situation, but remember we checked and know that mpg is never missing so we don't have to worry about it.

While this example is fairly silly, you'll find plenty of uses for this ability to access observations other than the one being worked on.

Commands to Examine Data

You now have a solid understanding of the Stata syntax, however, to do anything useful you'll need to know more commands than just list. We'll start with more commands for examining data.

describe (d) is a good place to start whenever you open a new or unfamiliar data set. It will give you information like the number of observations and variables, the size of the data set in memory, plus list of the variables it has and their types along with any labels describing them. Especially watch out for value labels.

summarize (sum) gives you summary statistics. If you just type:

sum

you will get basic summary statistics for all the variables in your data set (no varlist usually means the command should act on all the variables). Note that there is nothing for make: it is a string variable so means and such don't make sense. The details (d) option will give more information. Try:

sum mpg, d

tabulate (tab) will create tables of frequencies. It requires a varlist of either one or two variables. Try:

tab rep78

tab foreign rep78

tab rep78 foreign

to get an idea of what tab does. Tables are usually easier to read if the variable with the most unique values comes first, so they're listed vertically. There are are limits though. Try:

tab weight foreign

Because weight is a quasi-continuous variable it has too many unique values for this table to be useful. (tab foreign weight would be even uglier.)

The tab command won't accept more than two variables, but you can create three-way or higher tables by combining tab with by:.

bysort foreign: tab mpg rep78

If you want to create one-way tables for multiple variables, use tab1:

tab1 mpg foreign

Since tab1 only does one-way tables it won't interpret this as a request for a two-way table like tab would. In fact tab1 will take any number of variables and create one-way tables for them all.

tab has an option called sum which is similar to the sum command. Try

tab foreign, sum(mpg)

This gives summary statistics of mpg for the foreign and domestic cars in addition to their frequencies.

There's also a chi2 option that runs a chi-squared test on a two-way table:

tab rep78 foreign, chi2

Renaming and Labeling Variables

The next commands we'll learn actually make changes to the data set. We'll continue working interactively as before because it's the best way to learn, but I want to emphasize that for actual research you should only change your data set using do files. Remember, there's no "undo" in Stata.

It amazes and dismays me to see some of the gibberish that researchers often use for variable names (H2V06 and the like). Survey makers with 10,000 variables may not have a choice, but once you're down to the dozen variables you'll actually use you're not stuck using the names they gave you. Renaming variables to something that is meaningful takes a bit of time, but will save a great deal of time and confusion down the road. Variable names can be up to 32 characters long.

The names in the auto data set are generally good, but rep78 doesn't mean much on its own so let's change it to something more clear. rename (ren) will do the job:

ren rep78 repairRecord

Note that variable names must be one word with no spaces. However capitalization can make them more readable. For comparison try to interpret numinhh vs. numInHH (number in household). Another alternative is to use the underscore (_) as a space: num_in_hh. Personally I get tired of reaching for the underscore key, but it's a matter of taste.

The proper length for variable names is also a matter of taste. repairRecord is very clear, but it's fairly long. rep78 is much shorter, but doesn't really tell you what it means if you don't already know. In general the more often you use a variable the shorter it's name should be, both to save typing and because it will be familiar anyway.

Labels

There is a partial substitute for long variable names: Stata allows you to define labels for variables and values that appear in output. You only have to type them once, so they can be as long as you want (though if they're too long they'll make your output ugly).

This data set already has a good set of labels. Type d to see them. Then change the label of the data set itself by typing:

label data "1978 Automobile Data that came with Stata"

The label on foreign is a bit misleading, so change it too:

label variable foreign "Car Origin"

Type d again to see the results.

Next let's explore value labels by labeling the values of the repairRecord variable (formerly known as rep78). A value label is a mapping from a set of numbers to a set of descriptions. First you must create the map. Type the following:

label define repRec 1 "Very Bad" 2 "Bad" 3 "Average" 4 "Good" 5 "Very Good"

Then you need to tell Stata to label the values of the repairRecord variable using the repRec mapping you just created:

label values repairRecord repRec

To see the results, type:

tab repairRecord

Two final commands for labels: label dir gives you a list of all the defined labels, and label list tells you what they mean.

Once a map is defined, there's no limit to the number of variables you can apply it to. Suppose you're working with survey data and your variables include the gender of the respondent, the gender of the respondent's spouse, and the genders of all the respondent's children. You could define just one map called gender and then use it to label the values of all the gender variables.

Creating and Modifying Variables

generate (gen) creates new variables. The general syntax is simply:

gen newVariable=some expression

As an example, create a variable giving the car prices in 2008 dollars. To convert 1978 dollars to 2008 dollars you need to multiply the 1978 price by about 3.3:

gen price2008=price*3.3

Type:

sum price price2008, d

to see summary statistics of the results.

replace changes existing variables, but the syntax is identical to gen. Let's be a bit more precise: the conversion factor is actually closer to 3.31, so change price2008 accordingly:

replace price2008=price*3.31

Note that there is no abbreviation for replace. Commands that could destroy data never have abbreviations.

Both gen and replace can be used with if. When you make a new variable with gen it is created for every observation, but where the if condition is not true for a particular observation that observation gets a missing value for the new variable. With replace, where the if condition is not true the value of the variable is left unchanged.

Suppose you wanted to collapse the five-point scale of the repairRecord variable (formerly rep78) into a three-point scale. Here's one way to do it:

gen rep3=1 if repairRecord<3
replace rep3=2 if repairRecord==3
replace rep3=3 if repairRecord>3 & repairRecord<.

The first line creates the new variable, but only sets it to one for cases where repairRecord is less than three. The others get missing. The second line changes some of those missings to twos, and the third changes more of them to threes. Note how the third line specifically excludes observations where repairRecord is missing. What will the value of rep3 end up being for those cases? Missing, simply because it was never set to anything else.

(There is a recode command which can do this particular task more compactly, but the real goal here is to learn how to use gen and replace.)

One common recoding task is turning a categorical variable into a set of indicator variables, but tab has a gen option that can do that for you. Type:

tab repairRecord, gen(repair)

Now type d to see what you've done. Note that it even makes labels!

The egen command, short for extended generate, gives you access to a large library of functions--type help egen for a full list. With standard generate you have to (or get to, depending on your point of view) specify exactly what the new variable should be equal to. With egen you simply choose the function that meets your needs. If there isn't one, you're back to generate.

Suppose you wanted to find the mean value of mpg, calculated separately for the foreign cars and the domestic cars for practice with by:. egen has a mean function which will give you exactly what you want:

by foreign: egen meanMPG=mean(mpg)
l make foreign mpg meanMPG

But what if for some odd reason you wanted to create halfMeanMPG equal to the mean divided by two? What you can't do is add that to the egen command:

by foreign: egen halfMeanMPG=mean(mpg)/2

The error message you'll get is confusing, but the real problem is that all egen can do is set a variable equal to the result of a single function, not an expression like (like mean/2). If you really wanted to divide the mean by two you'd have to type:

by foreign: egen halfMeanMPG=mean(mpg)
replace halfMeanMPG=halfMeanMPG/2

One trick that can be very handy is setting a variable equal to a condition. Stata has no boolean (true or false) variable type. Instead it uses numbers. Zero is always false. If you're testing to see if something is true, Stata will interpret anything but zero as true. But if you're setting a variable equal to a condition, Stata will set the variable to one if the condition is true and zero if it is false.

For example, let's create an indicator variable called gasGuzzler that is one for cars that get less than average gas mileage and zero for others:

gen gasGuzzler=(mpg<meanMPG)

If the condition is true, gasGuzzler will be one. If it is false, gasGuzzler will be zero. We can then list all the gas guzzlers by typing:

l if gasGuzzler

This is equivalent to

l if gasGuzzler==1

but more natural. You can do the same thing with gender variables: if you've got gender encoded as male=0 and female=1, consider calling the variable female rather than sex or gender. Then you can write commands that end in conditions like if female instead of writing if sex==1 and having to remember what that means.

Let's do the ultimate modification of a variable:

drop MeanMPG

This eliminates the variable MeanMPG from our data set. We can also eliminate observations, for example:

drop if gasGuzzler

gets rid of all gas guzzlers (just doing our bit to fight global warming).

keep does the same thing, but in the opposite fashion. keep MeanMPG would get rid of all variables but MeanMPG, while keep if gasGuzzler would get rid of all the fuel-efficient cars.

Saving Data Sets

Finally, if you were doing anything important you'd need to save your work. Just type save filename to create a new Stata data set containing the data that is currently in memory. If this file already exists Stata will refuse to overwrite it unless you use the replace option, so in do files this command usually looks like:

save filename, replace.

Note that if you do not specify an extension, Stata will add .dta by default (which is what you want).

Never save over the data set you loaded. If you do, you can never rerun the do file (at least not in the same way) because the original input is gone.

Analysis

Stata has many, many commands for doing all sorts of statistical analysis. But they've also worked very hard to make them all as similar as possible. So while we're just going to do a simple linear regression, the syntax is almost identical for a broad range of more complex models.

Since we've made rather a mess of this data set, reload the original by typing:

use auto, replace

Now let's see how much consumers are willing to pay for good gas mileage using a simple, naive, hedonic pricing model. Whether a car is foreign or domestic seems to be important, so throw that in too. Type:

regress price mpg foreign

This regresses price on mpg and foreign. Note that regress takes a varlist, just like any other command, but it uses it in a particular way. The first variable is the dependent variable, and it is regressed on all the others in the list plus a constant (unless you add the noconstant option). The results suggest that American consumers dislike fuel efficiency, and will pay to avoid it!

Like any good researcher, when our empirical results contradict our theory we look for better empirical results. We just might have some missing variable bias here; in particular it's probably important to control for the size of the car. Looking over the variables we see lots of variables related to size. You could throw them all in, but they're probably highly correlated and you don't want to introduce collinearity. Check using the correlate (corr) command. Type (note that this is a great time to use the Variables window to enter variable names by clicking on them):

corr weight length displacement trunk headroom

While all the variables are positively correlated, weight, trunk, and headroom aren't too bad so go ahead and add all three:

reg price mpg foreign weight trunk headroom

Now mpg is insignificant but weight is highly significant. Looks like Americans like big cars and don't care about fuel efficiency. That I'll believe.

Graphs

Stata has a suite of tools for creating publication-quality graphs. Graphs are inherently complicated objects and the syntax for creating them can also get quite complicated. However, simple graphs with the default settings are very easy to make. For example, to make a scatterplot of mpg versus weight, simply type:

scatter mpg weight

If you want a line graph instead, type:

line mpg weight, sort

The sort option here does not mean Stata should sort the data. Rather it means that the line should be drawn from the observation with the smallest value of weight to the observation with the next smallest, etc. Without it the line would be drawn from observation one to observation two to observation three and so forth, and the result would look like a scribble (try it).

The easiest way to keep track of the many settings and details involved in creating a graph is to use the point-and-click graphical user interface. Stata will translate what you choose into a Stata command which you can rerun, put into a do file, or modify. Start by clicking Graphics, Twoway graph (twoway meaning a graph that has an X and a Y). Then click the Create button to create a new graph.

You'll then get a window where you can choose the basic properties of your graph. Leave the category set to Basic plots, set the type to Line and choose or type mpg as the Y variable and weight as the X variable. Check the box that says Sort on x variable. Then click Accept.

Creating a line graph

This will take you back to the main graphics Window. You could click Create again to add another graph which would be overlaid on the line graph you already defined. But there are several other tabs that control the properties of the graph.

Select the if/in tab and you can choose which observations are to be included. Type price<10000 in the If: box (note that you don't have to type the word if).

Selecting observations with if

Next click on the By tab. Check the box Draw subgraphs for unique values of variables and for Variables choose or type foreign.

By: options

Click OK, and the graph will be created. The command for creating it will also be placed in the results window:

twoway (line mpg weight, sort) if price<10000, by(foreign)

Note how for graphs by is an option, not a prefix like you're used to. That's because you're not creating two completey separate graphs for the domestic and foreign cars like you would with the standard by:. Instead you're creating one graph with the two subpopulations next to each other.

If you click Graphics, Twoway graph again the same settings will still be there so you can refine the options you chose and try again. Once you've got the graph you want, copy the resulting command into a do file. If you want to start a new graph instead, click on the large R (reset) button in the lower left of the window.

For much more information about creating graphs, see An Introduction to Stata Graphics.

Do Files

You now know how to construct useful commands from the components of Stata syntax. Next it's time to learn how to organize those commands into do files.

Do files are simply text files whose names end with .do and which contain Stata commands exactly the way you'd type them into the command window. Since they are plain text you can use any text editor you prefer, including TextPad, emacs, vi, or even Notepad. Stata includes a simple text editor, very similar to Notepad (but it's also available in Linux). If you need to run do files on a Linux server but don't know any Linux text editors, take a look at Running Linux Programs Using Windows (Mostly).

To make a do file, open a text editor and start typing Stata commands, pressing Enter at the end of each one. Then save it as filename.do. That's it.

Logs

Every do file should have a corresponding log file which records what actually happens when the do file ran. If you run your do file in batch mode, reading the log is the only way you'll get your results. To start logging the command is:

log using filename.log, replace

where filename is the name of the file you want Stata to use as a log. All commands and their output will be saved in that file. The replace option tells Stata that if a log file with that name already exists, say from a previous attempt to run the program, it should be replaced by the current log.

Note that if you do not specify the .log at the end of the filename, Stata will save the log using its Stata Markup and Control Language. SMCL has its uses, but it can only be read by Stata's Viewer. If your filename ends with .log, Stata will save the log as plain text which you can read in any text editor.

When your are done with everything that needs to be recorded, type

log close

Comments

Comments are bits of text included in a do file for the benefit of human readers, not for Stata. When Stata sees the characters /* it will ignore everything that follows until it sees */. Comments should explain what the code is doing and why, and if anyone else ever needs to read and understand your code, good comments are invaluable. But the most likely beneficiary is yourself: in six months (let alone ten years) your code might as well have been written by someone else.

You don't need to comment every line of code--most Stata code is fairly easy to read. But be sure to comment anything that required some particular cleverness on your part.

Writing a Do File

Let's write an actual do file. Open your favorite text editor--if you don't have one we suggest TextPad on Windows. Save the blank document in a convenient location (perhaps your U: drive) as stataintro.do so your editor will know it's a Stata do file (TextPad and emacs will color it accordingly). Then type something along the lines of:

log using stataintro.log, replace
set mem 5m
use "c:\program files\stata10\auto", replace
/* That's the Windows path--if you're using Linux replace it with /software/stata/auto */

/* Some things you could do with this data--feel free to make up your own */
sort foreign mpg
by foreign: l make mpg if _n==1 | _n==_N
gen greatCar=(rep78>3 & mpg>25)
logit greatCar foreign price
log close

Save it when you're done.

Running a Do File

To run your do file, go back to Stata. First you need to change to the proper directory--the one where you saved the do file--using cd. If you put it directly in your U: drive the command would be

cd U:

Then actually run it by typing do and then the filename.

do stataintro

Stata will assume that the filename ends with .do. You'll then see all your results. If the do file doesn't run properly you'll need to make changes and run it again, but read the next section before doing so. Also open the log file in your text editor so you can see what it contains.

Running Do files in Interactive Stata

If you are using Windows Stata or an interactive Stata session in Linux there are some additional commands you'll want to add to the beginning of your do file. First off, you don't want to have to sit there and press the space bar every time the Results window fills up and Stata says --more--. You can prevent that by putting

set more off

at the beginning.

Then you want to make sure that whatever happened before your do file was run doesn't cause problems. You can get rid of any previous data in memory by adding

clear

but there could also be an open log file. One common scenario (in fact you may be experiencing it right now) is where a do file opens a log but crashes before closing it. The log thus remains open. If you fix the problem with the do file and then try to rerun it it will crash again because it can't open a new log. You can fix that by typing log close, but that only works if a log file is actually open--otherwise your do file crashes again. The solution is to use capture:

capture log close

The capture command allows your do file to proceed even if the following log close command generates an error because no log was open.

These three commands should probably be the first three lines of any do file meant to run in interactive Stata:

set more off
clear
capture log close

Running Do files in Batch Mode

In Linux you can submit a job to Stata in batch mode. Batch mode Stata doesn't waste CPU time drawing windows or putting results on the screen. It simply starts up, runs your do file, and quits when it is done without any further intervention. You then get the results by opening the log file. To run a do file in batch mode, type:

stata -b do filename

at the Linux command prompt. Note that if you plan to run a do file in batch mode there's no need for the additional commands described in the previous section (though they won't hurt). If your do file will take more than a few minutes to run, consider submitting it to Condor by logging into Kite and typing:

condor_stata -b do filename

The SSCC has a tremendous amount of computing power available through our Condor flock. See An Introduction to Condor for more information.

Windows Stata does not have a batch mode. However, you can prevent Stata from wasting CPU time updating the Results window by putting it in the background.

Organizing Your Research Project

Now we'll go the opposite direction: taking a research project and breaking it into do files. Consistently following a few best practices can save you a tremendous amount of time and headaches, and reduce the probability of making serious mistakes.

In a typical situation you have a research question you want to answer and some data that you think will answer it, but the data isn't in a form that can actually answer the question--yet.

Begin with the End in Mind

The first thing you should do is figure out what form the data will need to be in in order to be useful. What should an observation represent? What variables will each observation need to contain? The answers to these questions will most likely be determined by the statistical techniques you plan to use. But planning it out ahead of time will prevent you from spending time manipulating the data in ways that don't actually end up meeting your needs.

Don't Try to do Everything at Once

Once you've got the goal clear in your mind, the last thing you should do is sit down and write one massive do file that gets you there in one step, only trying to run it once it's "done." First of all this is a recipe for frustration, as the result will most likely be a massive number of bugs. Even worse, you may find that in order to make the early parts work you'll need to do something in a different way than you originally thought. You'll then have to change everything that follows.

It's far better to write a bit of code, test and debug it, then write a little more, test and debug it. But then you end up rerunning the old code you know is good every time you want to test the new code that may be bad. The solution is to break up your project into multiple do files. That way you only need to rerun the part that you're currently working on.

Never Write your Output over your Input

Most do files you'll write will start with some input data file, do things with it, and save the result. However you should never have a do file save its output over its input. If you do, you can never run that do file again because the input it was written to process is now gone. If it turns out that the do file contained an error, you may be reduced to asking the consultant to restore your input data from the SSCC's backup tapes.

Make your Workflow Reproducible

On the other hand, if you plan your workflow properly you can recreate your entire project at will.

Start with the data as you obtained it. Your first do file will read it in, make some changes, and save the results in a separate file. Your second do file will read in the output from the first do file, make further changes, and then save its results in another separate file. Repeat until your project is done. If your data files are large, you can delete all but the original and the input data for the do file you're currently working on. If you follow this procedure you can recreate everything you've done at will just by rerunning all your do files. (It's also a good idea to make a "readme" file for each project with information like what order its do files must be run in.)

This method is also extremely helpful for debugging. If you discover a problem you can go back through your do files one by one until you find the error, fix it, and then rerun the corrected code for your entire project with just a few commands. It's also easy to make changes.

Learning More

Congratulations, you now know enough Stata to get you started. There's a great deal we haven't covered, of course, but Stata has excellent tools for learning more.

Your first resource is the Stata help files, which contain the bulk of the printed documentation. To see the help for a particular command type help command, e.g.

help egen

You'll get a syntax diagram, a brief explanation of the various options, and even examples.

However, you'll very often know what you want to do but not the name of the command that will do it. Then findit is your best bet. For example, suppose you want to do something with Heckman selection models. If you type

findit heckman

you'll get a tremendous amount of information. First Stata will search the help files and point out that there is a heckman command, along with related commands like treatreg. Then it will search the Frequently Asked Questions files on Stata's web site (and the large statistical web site at UCLA). Finally it will search through the user-written programs that have appeared in the Stata Journal, the old Stata Technical Bulletins, or in the Boston College Statistical Software Components archive. You can download and install these programs right from the Stata viewer and use them immediately.

Next, the manuals are excellent. They are available for short-term checkout in the CDE Library or for reference in the 4218 computer lab and can be purchased through the Stata's "GradPlan" at reasonable price and great speed. The User's Guide is the first place to look for general concepts, but the Reference books are the place to go for help using specific commands and estimators.

The SSCC's publication collection has a large section on Stata, including general guides like this one, An Introduction to Stata Graphics or Programming in Stata, plus discussions of specific topics like Bootstrapping in Stata or Using Stata Graphs in Documents.

We also offer classes on Stata each semester--see the training web page for details and to register.

Finally, the SSCC consultant is available to help. We cannot write your Stata programs for you. But we will be more than happy to help with planning your project, figuring out the commands that will make your program work, and of course finding and fixing bugs.

UW Home Page Article on the Carillon Tower