Programming in Stata

Almost as soon as you start writing Stata code, you start looking for ways to write code faster and with less errors. One solution is to make one piece of code do more than one thing. While this may make the code a bit more complex and harder to debug, it saves having to write and debug a separate piece of code for each task. This article will teach you how to write this kind of flexible code. We will cover local macros, programs, loops, and a few miscellaneous tools along the way. We'll end by writing our own brand new Stata command (an ado file).

This article includes many examples which are also available as do files (and one ado file). Links to these files are included within the article, but you may want to get them all at once so you are not interrupted as you work. If you are using an SSCC Linux server, the following commands will create a directory called stataprog and save all the files there.

mkdir ~/stataprog
cd ~/stataprog
cp /usr/global/web/sscc/pubs/files/4-15/* .

You can also download all of them from the web by going to this list of files. Note that the files are written assuming they will be run on an SSCC Linux server. In particular, some of them load the auto data set from /software/stata/auto. You will need to change that to the directory where Stata is installed on the computer you are using. For example, on the Winstats you need to change it to "c:\program files\stata9\auto".

Now on to the programming tools.

Local Macros

Local macros are somewhat like variables in programming languages. They are "boxes" where you can store things and pull them our later. This allows you to write code that will do different things depending on the value of the macros at the time it is run.

Macros are easy to define; try typing the following:

local x=1
display `x'

The first line defines a local macro called x and sets it equal to 1. The second displays `x'. It is critical you see that the single quotation marks around the `x' are not the same (how different they look depends on the font). The left quotation mark (`) is found under the tilde (~), usually in the upper left corner of the keyboard. The right quotation mark (') is found under the double quotation mark (") usually in the center-right of the keyboard. You must put the left quote before the macro name and the right quote after, or Stata will complain at you.

While macros can be used like variables, they are not really variables. What really happens is that macros are replaced by the text they contain before Stata interprets the command. So in our example display `x' is exactly the same as typing display 1. All macros are stored as strings, even numbers. In fact we don't even need the equals sign in the macro definition unless we want Stata to do some math first.

local x 1

is the same as

local x=1

and Stata will actually process the first a bit quicker. The following example should show you when to use the equals sign. Note that display "stuff" means to display the stuff in the parentheses as a string, without evaluating it.

local x 2+2
di `x'
di "`x'"

local x=2+2
di `x'
di "`x'"

In the first case the local macro x contains "2+2", as we could see when we display x as a string. In the second case 2+2 was evaluated to be 4 and then stored in x. Here's a test: What will the following code display (note that ^2 means raise to the second power)?

local x -2
di `x'^2

If you guessed 4, you forgot either the precedence of algebraic operators or how Stata uses macros. `x' is replaced by -2 before Stata does anything with it, so it sees -2^2. But the power takes precedence over the minus sign, so this is the same as -(2^2), not (-2)^2. If `x' were a variable like in other programming languages, the minus sign would not be separate from the 2.

The nice thing about macros not being variables is that you can put almost anything in them and use them absolutely anywhere. You can even include macros in macro definitions. Try:

local i=`i'+1

Right now, `i' has nothing in it. So what just happened? An undefined macro will be replaced by nothing. So what Stata saw was

local i=+1

which is perfectly legal. In most cases however, using an undefined macro will lead to syntax errors. If you mistype a macro name, Stata will assume you meant some other, currently undefined macro--this can be a particularly difficult error to debug.

Run that command again:

local i=`i'+1

This time `i' does have some table, so Stata sees

local i=1+1

and `i' is set to 2. You could use a command like this to increment a counter.

Macros are perfectly legal in file names for log files and data sets. For example if you were creating separate data sets by race and sex, you could just define macros for race and sex and then use them in the save command. If you type

local race Black
local sex Women
save `race'`sex'

then Stata creates a data set called BlackWomen.dta. If the save command followed

local race White
local sex Men

it would create WhiteMen.dta.

You could also use macros as a replacement for copy and paste in your text editor: assign a macro to a short piece of code that is repeated in your program and just use type the macro instead of the code. But this generally makes your code much harder to read. For example:

tab `macro1'

could be tabulating anything, and could include if or in conditions and options. In order to know what's going on, you have to find the most recent definition of macro1. On the other hand, if used wisely macros can make your code clearer. The key is to use well-named macros to substitute for logical chunks of code. For example, if you had a big list of control variables that you used constantly, you could define the list as a a macro called controls. Then instead of

reg income edu race occupation location... (many more control variables)

you could type

reg income edu `controls'

and be done. Or if you repeatedly deal with subsamples of your data, you could define a macro that gave the conditions for that subsample. For example

reg income edu `controls' if race=="black" & sex=="female"

could also be done as

local blackWomen race=="black" & sex=="female"

reg income edu `controls' if `blackWomen'

You could save a bit of typing by including the if within the macro; clearly it only makes sense when following an if. But you don't want to in order to preserve the readability of your code. If you include the if, it's not clear what the macro `BlackWomen' does. But if `BlackWomen' makes it fairly obvious that the macro just gives the particular conditions that define a black woman in your data set. In both of these cases, the macros (perhaps arguably) make your code clearer by hiding the details of the implementation.

To see the above examples in action, run macro.do.

Programs

A program allows you to define a chunk of code in one place and run it repeatedly. You can also pass in parameters which will be stored as macros, then use those macros in various ways within the program.

A Stata program is just some Stata code with the line program define name at the beginning and the line end at the end. It is a tradition that the first program you write in a new language simply display the message "Hello World" (who starts these traditions?) so let's do that. Type:

program define hello
1. di "Hello World"
2. end

Note that Stata provides the line numbers for you; you will not put them in when you write a program in a do file (see the example code at the end). To run your program, type hello. Okay, now you've paid your dues to tradition. More importantly you now understand the mechanics of writing a program. Now any time you want to say "Hello World" all you have to do is type hello. What a time-saver! But even supposing you had a reason to say "Hello World" once, to say it more than once in exactly the same way seems a bit redundant. You need to add some flexibility to your program, and that's where the macros come in.

When running a program, anything typed after the program name will be interpreted as arguments. Arguments for programs work much like mathematical functions: the program does whatever it does depending on its arguments. Within the program, arguments are referenced by number: `1' is the first thing after the program name, `2' the second, etc. (spaces define where one argument ends and the next begins).

Change your program a bit:

program define hello
1. di "Hello `1'"
2. end

But you never got that far, did you? Instead you got an ugly message saying

hello already defined
r(110);

You can't have two hello programs at once. You need to get rid of the original by typing:

program drop hello

This can be a minor nuisance if you're running Windows Stata or GUI Stata on Linux, where you can't really be sure what has been going on before your do file is executed. If you try to define a program that's already been defined, your .do file will crash with the message you just saw. If you try to drop a program that hasn't been defined (for example if you tried program drop hello twice) your do file will crash again, with the message

hello not found
r(111);

The solution lies in the capture command. When a command is preceded by capture, any errors it generates are ignored (they are captured by capture). So capture program drop hello will get rid of a program called hello if it exists, and do nothing if it doesn't.

Now back to your modified hello program. You see that it now calls upon the local macro `1'. Try running it by typing hello and see what it does. Since we haven't defined `1', it is ignored. More precisely, the `1' is replaced by its value: absolutely nothing. Now try typing

hello Russell

The program responds Hello Russell.

hello Russell Dimond

just displays Hello Russell because Dimond is stored in `2', and our program doesn't do anything with `2'. On the other hand, the macro `0' contains all the arguments, which gives us some additional flexibility. Let's change our hello program one more time:

program drop hello
program define hello
1. di "Hello `0'"
2. end

Now try:

hello
hello Russell
hello Russell Dimond, how are you today?

If you are an excellent typist you may have missed an important lesson, but the rest of us got it: never try to input a program interactively. One mistake and you have to drop the program and start all over again. Always define programs in .do files.

Finally. if you have ever used a general-purpose programming language (FORTRAN, C/C++, Java, whatever) or if you work with someone who has, be prepared for a bit of confusion about nomenclature. The logical equivalent of a Stata .do file in these languages is a program, while the logical equivalent of a Stata program is (depending on the language) a subroutine, function, procedure, or method (at any rate just a part of a program). You'll probably hear people refer to do files as programs all the time (I do it), and don't be confused if someone starts calling Stata a program a subroutine.

The file program.do will run all these examples.

Loops

Most Stata commands are really loops. Stata carries out the command for the first observation, then the second, and so forth. Take advantage of this looping structure whenever you can, because it is quite fast. But it's not hard to imagine other loops you might want: for example, you might want to execute the same command for five different variables. Stata allows you to do these too--you just have to write them yourself.

foreach

The foreach command allows you to create loops that loop over a list of things.

foreach macroname in/of [list type] list {
code involving `macroname'...
}

macroname is a name we choose to represent the elements in our our list. As always, make the name informative. The in/of construct means you will use either in or of (not both), depending on the type of list. You'll use in for generic lists, of for all others. The list type is optional. If it is omitted Stata will interpret what follows as a generic list. Finally there will be the list to be acted on, and then a left curly bracket. Note the placement of the brackets: the first one must be part of the same statement as the foreach (it cannot go on the next line like in C/C++ or Java) and the last one must be its own statement (it cannot go at the end of the last command inside the loop). Everything inside the curly brackets will be executed once for each item in the list, and macroname is a local macro that will contain each item in the list in turn. Let's look at an example:

foreach color in red blue green {
1. di "`color'"
2. }

will give the output

red
blue
green

Note that, as with programs, Stata gives you line numbers when you type a foreach loop interactively but you will not need to type them in do files. Using in with no list type indicates a generic list. Stata makes no attempt to interpret what follows other than break it up into elements. Thus our loop runs three times, once for each element. The first time, `color' is set to the word red, the second time to blue, and the third time to green.

It's very common for your list of items to be stored in a macro that was contructed earlier in the program. You can use such a macro directly in a foreach command:

local colors red blue green
foreach color in `colors' {
1. di "`color'"
2. }

However, this is so common that Stata wrote special code to handle this case more efficiently.

local colors red blue green
foreach color of local colors {
1. di "`color'"
2. }

Note that in changed to of because local is officially a list type, if a rather odd one. Also note that colors is not in quotes in the foreach command. If it were in quotes, the standard macro processor would expand it out to red blue green. Instead, we let the local list type look up what the macro means, which it does very quickly.

Normally list types tell Stata tell what types of things are in your list. The available types are varlist, newlist, and numlist.

For the next example, we'll use the auto data set that comes installed with Stata. Load it by typing

sysuse auto

(The sysuse command loads a file from whatever directory Stata is installed in--it's only useful for examples.)

The varlist construction specifies that what follows is an official list of variables. That's not quite as important as it sounds, because you can also put variable names in generic lists. But compare the following:

foreach var in price mpg rep78 {
1. di "`var'"
2. sum `var'
3. }

foreach var of varlist price-rep78 {
1. di "`var'"
2. sum `var'
3. }

foreach var in price-rep78 {
1. di "`var'"
2. sum `var'
3. }

In the first case, foreach interpreted the list as three words, each of which the sum command later recognized as variable names. In the second case, foreach was forwarned to expect a variable list, and thus interpreted price-rep78 as a list of three variables. However, in the third case foreach had no such warning and interpreted price-rep78 as a single word. As a result the loop was actually executed just once. It was the sum command that later interpreted price-rep78 as a variable list containing three variables.

newlist is for lists of new variables; variables which do not yet exist but will be created inside the loop. For example:

foreach var of newlist x1 x2 x3 x4 x5 {
1. gen `var'=0
2. }

newlist checks to make sure the list only contains valid variable names, but does not actually create the variables--gen does that.

numlist is for lists of numbers. Compare this with the previous:

foreach i of numlist 1/5 {
1. gen y`i'=0
2. }

Note how the `i' macro acts like a subscript to the y variable. This is a very common construction: population`year', income`wave', etc.

forvalues

On the other hand, looping over a list of evenly-spaced numbers is the specialty of forvalues, and it will do it faster than foreach. Also, since foreach has to construct the whole list of numbers before it can start, it can only handle relatively small lists. forvalues has not such limit. It's quicker to type too:

forvalues i=1/5 {
1. gen z`i'=0
2. }

forvalues isn't limited to counting upwards by one--type help forvalues for details on other contructions.

Use loop.do to run all these examples.

Stored Results

This section may not be a programming topic, but it is a tool we'll use in our final example. And it's good to know anyway.

Many Stata commands store values in an internal array you can access once you know it's there. Estimation commands create an array called e( ), and you can see what's in it by typing ereturn list. Almost all other commands that return results put them in an array called r( ), and you can see what's in r( ) by typing return list. The manuals also describe what each command returns. The only trick is that every command that uses the e( ) or r( ) arrays overwrites the previous tables. So if you want to do anything with the results of a command, you must do it before you issue another command that returns values. One option is to save the results in a variable or local macros for later use. Try the following:

reg price weight foreign mpg
ereturn list
sum weight
return list

If you want to demean weight (subtract the mean from all observations), all you have to do is type

replace weight = weight - r(mean)

Try it and then do

sum weight

again to see the results. Note that there are issues with numerical precision, but you've accomplished your purpose. Keep in mind that you have also replaced the old values of the r( ) array with a new set of values referring to the second time you ran summarize. Good thing you were done with the old results.

To see this in practice, take a look at results.do.

A Program to Demean Data

Let's put together everything you've learned by writing a program that demeans data. This is a simple enough task that a program isn't really needed, but we'll go a step further and make it both flexible and error-resistant. In other words, we'll put a lot more effort into it than it's worth (except as a learning experience, of course).

We'll start with the simplest possible version (which is generally a good idea when programming). It will take one argument, a variable name, and demean that variable.

program define demean
1. sum `1',meanonly
2. replace `1'=`1'-r(mean)
3. end

Try it out and see how it does (just reload the auto data set if you start running out of variables with non-zero means).

That's fine as far as it goes. But suppose you wanted to demean 20 different variables? It's time to add a foreach loop. Recall that local macro `0' (zero) contains all the arguments passed in to the program. We could have our foreach loop work with this as a variable list, or even a generic list. But local was created for exactly this kind of situation and will run a bit faster. So the next version is:

program drop demean

program define demean
1. foreach var of local 0 {
2. sum `var',meanonly
3. replace `var'=`var'-r(mean)
4. }
5. end

There's just one problem with your demean program. To see it type demean make. The make variable is a string. It has no mean, and so your program crashes. Now, you may be thinking that anyone who tries to demean a string deserves what's coming to them, but let's fix it anyway, just so you can learn how. You may not be able to demean a string, but you can give a better error message, and then proceed to demean any other variables that were requested and are valid.

If as a Way to Control Program Flow

You're used to using if at the end of commands. That meant "execute the preceding command for a given observation only if this condition is true for that observation." What you're going to do now is very different. You're going to say "only execute the following commands for ANY observation if this condition is true." The condition itself is also different: it is a scalar. It is evaluated just once, not once for each observation. If the condition includes a variable, the value of that variable for the first observation will be used. It is also possible to combine if with else, so you can make arbitrarily complex sets of conditions. The syntax looks like this (this is a fairly complex example so can you see how all the pieces work--we'll do something simpler in our program):

if condition1 {
commands to execute if condition1 is true...
}
else if condition2 {
commands to execute if condition one is false and condition2 is true...
}
else {
commands to execute if both condition1 and condition2 are false...
}

Note how the brackets have to be placed just like with foreach.

The problem with your program is that as soon as Stata sees you try to subtract something from a string variable, it crashes with the message

type mismatch
r(109);

before it even looks at any observations. So your job is to detect strings before you try to demean them, and only subtract things that can be subtracted. You can do this using the confirm command. It's a bit like assert in that you use it to check on things you believe to be true, but it's designed for programmers. Thus it allows you to check things like that a file actually exists, or in this case, that a variable is numeric and thus has a mean. The syntax is

confirm numeric variable var

where var is the variable you're checking. It will do nothing if the variable is numeric, and cause an error if it is not. But you don't want it to crash the program, so put capture in front of it.

But how will you know the result if you use capture? Every command creates a variable called _rc when it runs, which is short for return code. A return code of zero means the command was successful. Any other value means something went wrong (different errors give different return codes). So all you have to do is check the value of _rc with an if statement. If _rc is zero, you know the variable is numeric and you can demean it. If not, you give an error message but the program continues to run and processes the rest of the variables.

program drop demean

program define demean
1. foreach var of local 0 {
2. capture confirm numeric variable `var'
3. if _rc==0 {
4. sum `var',meanonly
5. replace `var'=`var'-r(mean)
6. }
7. else di "`var' is not a numeric variable and cannot be demeaned."
8. }
9. end

The file demean.do contains and demonstrates all the various versions of the demean program. You'll also notice some comments and a great deal of indenting to make the logical structure easy to see. Both practices are highly recommended.

ado (Automatic Do) files

You now have a nice little program that could be useful in a variety of settings. But you have to run the code that defines it before you can use it. What if you could make it act like any other Stata command and run as soon as you type it? You can, by making it an ado (automatic do) file.

An ado file is just like a do file that defines a program, but the filename ends with .ado and it is stored in one of several ado directories. When you type a command, Stata checks the ado directories to see if there is an ado file with that name. If there is, Stata automatically runs the ado file that defines the program and then executes it. Thus from the user's perspective, using an ado file is just like using a built-in Stata command. In fact many Stata commands are actually implemented as ado files.

In order to create an ado file, you need isolate the demean program in a separate file and save it as demean.ado in your personal ado directory. You can identify your personal ado directory by typing sysdir. On the SSCC's Linux servers, it is ~/ado/personal (recall that ~ means your home directory). On the Winstats it is w:\ado\personal.

Once that's done, demean.ado will almost be like an official Stata command. Not quite though: note that we made no provision for standard Stata syntax like by: or if. Doing so isn't actually as hard as you might think, but still beyond the scope of this article.

You've now learned a powerful set of tools that can save you a great deal of time and trouble. At first you may need to consciously look for opportunities to use them. But they will soon become second nature, and writing code without them will seem unbearably tedious. Consider that progress.

Last Revised: 9/11/2007