Stata Programming Essentials

Ever needed to do the same thing to ten different variables and wished that you didn't have to write it out ten times? If so, then this article is for you. If not, someday you will—so you might as well keep reading anyway.

Stata has all the tools required to write very sophisticated programs, but knowing just a few of them allows you to make everyday do files shorter and more efficient. This article will focus on those programming tools that, in our experience, anyone who uses Stata heavily will eventually want to learn. To benefit from this article you'll need a solid understanding of basic Stata syntax, such as you can get from our Stata for Researchers series. The primary intended audience is Stata users with no other programming experience. If you've done a lot of Stata programming already and are looking to expand your "bag of tricks" check out Stata Programming Tools.

This article is best read at the computer with Stata running. Typing the commands in the examples yourself will help you notice and retain all the details, and prepare you to write your own code.

Macros

A Stata macro is a box you put text in. You then use what's in the box in subsequent commands. (The real trick is getting a single command to run multiple times with a different bit of text in the box each time--we'll get there).

The macros we'll use are "local" macros. If you're familiar with global and local variables from other languages, Stata's local macros are local in the same way. If not, just trust us that local macros are the right ones to use.

The command to define a local macro is:

local name table

For example:

local x 1

This creates a local macro called x and puts the character '1' in it (not the value 1 as in "one unit to the right of zero on the number line"). To use a macro, you put its name in a command, surrounded by a particular set of quotation marks:

display `x'

The quote before the x is the left single quote. It is found in the upper left corner of the keyboard, under the tilde (~). The quote after the x is the right single quote. It is found under the double quotation mark (") on the right side of the keyboard.

Macros are handled by a macro processor that examines commands before passing them to Stata proper. When it sees a macro (denoted by that particular set of quotation marks) it replaces the macro with its table. Thus what Stata proper saw was:

display 1

Now try a slightly more complicated macro:

local x 2+2
display `x'

The result is 4, but that's because the display command acts like a calculator. The command Stata saw was:

display 2+2

so it evaluated 2+2 and gave you the answer. If you want display to put something on the screen without evaluating it, put it in quotes. Then display will treat it like a string.

display "`x'"

gives the result 2+2. But consider what happened before you put it in quotes: your macro contained a working bit of Stata code which Stata happily executed when you used it. In fact Stata proper didn't know or care that 2+2 came from a macro. This feature allows you to use macros absolutely anywhere, even in macro definitions.

Storing Results in Macros

If you want to put the result of a calculation in a macro, put an equals sign after the macro name:

local x=2+2
display "`x'"

If the local command contains an equals sign, Stata will evaluate what follows before putting it in the macro. Now x really does contain 4 and not 2+2 no matter how you display it.

Macro Expressions

Stata's macro processor can evaluate Stata expressions; i.e. any formula you could put after the equals sign in a generate or replace command (but not egen). The syntax is:

`=expression'

where expression is the expression to be evaluated. Try:

display "`=2+2'"

The result is 4, but display didn't calculate it (the quotes prevent that). Instead, the equals sign before 2+2 told the macro processor to evaluate that expression and put the result in the code, so what Stata proper saw was display "4". Another common use is `=_N', which will be the number of observations in the current data set (and can be used in places where _N by itself can't).

Macro expressions--and macros in general--can contain other macros. Try:

display "`=`x'-1'"

This tells the macro processor to subtract one from the value of the macro x and then place the result in the code. This can be extremely useful: for example, if you had a macro `year' containing the current year, `=`year'-1' would be the year before the current year.

Undefined Macros

Unfortunately, using a macro you haven't defined doesn't generate an error message. Stata's macro processor just replaces it with nothing:

display `y'

Gives the same result as:

display

This can cause headaches: if you mistype a macro's name you'll probably get a generic syntax error with no indication that a macro is the cause of the problem. Even worse, in some circumstances the command will still work but give incorrect results. Be very careful to type the names of macros properly.

Some Uses for Macros Outside of Loops

The main reason for learning about macros is so you can use them in loops. But there are times when using them all by themselves can make complex code easier to read.

Suppose you need to run a large number of regressions of various types, but they all include a fixed set of control variables. Consider putting the list of control variables in a macro:

local controlVars age sex occupation location maritalStatus hasChildren

This will make the regression commands shorter:

reg income education `controlVars'
logit employed education `controlVars'

Now suppose you frequently work with subsamples of your data set. You can define macros for them as well:

local blackWoman race==1 & female
local hispMan race==2 & !female
reg income education `controlVars' if `blackWoman'
logit employed education `controlVars' if `hispMan'

The point here is not to save keystrokes, but to make the code more clear. Using macros hides the details of what the control variables are or how a black woman can be identified in this data set and helps you focus on what you're trying to do. Not having to type out those details every time also removes an opportunity for error. You can make changes more quickly too: if you need to add a control variable you only have to add it to the definition of the controlVars macro rather than adding it to each regression command.

Saving keystrokes is a nice side effect, but resist the temptation to make your code less clear in the name of making it shorter. Taking a few minutes to type out clear code is far more efficient than spending hours debugging code that's short but hard to understand.

For Loops

A foreach loop takes a list and then executes a command or set of commands for each element of the list. The element currently being worked on is stored in a macro so you can refer to it in the commands. The list to be looped over can be a generic list containing text, or there are several kinds of structured lists (we'll only discuss varlists).

The syntax for a foreach loop with a generic list is:

foreach macro in list {
command(s)
}

As a very simple example:

foreach color in red blue green {
display "`color'"
}

Here, color is the name of the macro that will contain the list elements. red blue green is the list itself. Stata breaks the list into elements wherever it sees spaces, so this list contains three elements: red, blue, and green. The left curly bracket ({) marks the beginning of the loop and must be at the end of the foreach command. The right curly bracket (}) marks the end of the loop and must go on its own line. If you type this in interactive Stata the Results window adds line numbers for the commands inside the loop, but you do not need to type them. Note how nothing is actually executed until you type the right curly bracket, and then Stata runs the whole thing. When it does you'll get the following output:

Stata begins by analyzing your list and identifying the elements it contains. It then puts the first element (red) in the loop's macro (color) and executes the command in the loop. Given the tables of color, the command becomes display "red" and red is printed on the screen. Stata then puts the second element in the macro and runs the command again, printing blue on the screen. It then repeats the process for green, and when that's done Stata realizes the list is out of elements and the foreach loop is complete.

Throughout this article you'll see that commands which are inside a loop are indented. This makes the loop's structure visually obvious and we highly recommend you do the same when writing do files. All you need to do is press Tab before you begin the first line of the loop. Stata's do file editor and any other text editor suitable for programming will indent subsequent lines automatically. (There's no need to worry about indenting when working interactively, but in real work it's very rare to use loops interactively.)

You can use a generic list to loop over many different kinds of things: variables, values, files, subsamples, subscripts, anything you can describe using text. If an element needs to contain spaces, put it in quotes.

Looping over Variables

The most common thing to loop over is variables. For example, suppose you wanted to regress several different dependant variables on the same independent variables. The following code does so, using the automobile example data set that comes with Stata:

sysuse auto
foreach yvar in mpg price displacement {
reg `yvar' foreign weight
}

Looping over Parts of Variable Names

Consider the following data set:

use http://www.ssc.wisc.edu/sscc/pubs/files/stata_prog/months.dta

It contains a fictitious (and not terribly plausible) data set of people and their incomes over twelve months. This is panel data in the wide form, so there are twelve income variables: incJan, incFeb, incMar, etc. Suppose you want to create a corresponding set of indicator variables for whether the person had any income in that month. Creating one of them is straightforward:

gen hadIncJan=(incJan>0) if incJan<.

but creating all twelve in the same way would be tedious.

(If you checked, you'd find that this data set does not have any missing values so excluding them with if incJan<. is not strictly necessary. Consider it a reminder to always think about missing values when creating such indicator variables.)

You can create all twelve indicator variables quickly and easily with a foreach loop:

foreach month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec {
gen hadInc`month'=(inc`month'>0) if inc`month'<.
}

This sets up a generic list containing the months, and then uses those months as parts of variable names.

Note the process we used to create this loop: first we figured out the command we'd use for a single element of the list and then changed it to use macros. This is a good habit whenever you need to write non-trivial code involving macros.

Looping over Varlists

While generic lists can contain variable names, you have to type out all the names individually. If you tell Stata that the list you want to loop over is an official Stata varlist you can use standard varlist shortcuts, like x* for all variables that begin with x and x-z for all the variables from x to z. To review varlist syntax, see the appropriate section in Stata for Researchers.

The syntax for a foreach loop over a varlist is as follows:

foreach macro of varlist vars {

Note that while the foreach syntax for a generic list contains in, the syntax for a structured list has of. Stata uses the in or of to determine whether the next word is the first element of the list or a type of list.

Researchers occasionally receive data sets created in other programs where the variable names are in upper case letters. Since Stata actually cares about case, upper case variable names can be tiresome to work with. Stata recently gave the rename command the ability to convert names to lower case:

rename *, lower

But this such a great example that let's do it with a foreach loop over a varlist anyway:

foreach oldname of varlist * {
local newname=lower("`oldname'")
rename `oldname' `newname'
}

The asterisk (*) all by itself matches all variables, so the list foreach is to loop over contains all the variables in the current data set. The lower() function takes a string, in this case the tables of the macro oldname, and converts it to lower case. Note the use of the equals sign in the local command that defines newname, so that lower("`oldname'") is evaluated and the result is stored.

Looping over Numbers

A forvalues loop (frequently abbreviated forval) loops over numbers. Rather than defining a list, you define a range of numbers.

By far the most common range consists of a starting number and an ending number, and Stata assumes it should count by ones between them. The syntax is simply:

forvalues macro=start/end {

For example:

forvalues i=1/5 {
display `i'
}

gives the output:

If you need to count in a different way, type help forvalues to see more options.

Consider the following data set:

use http://www.ssc.wisc.edu/sscc/pubs/files/stata_prog/years.dta

This data set is very similar to the data set of monthly incomes we examined earlier, but it contains yearly incomes from 1990 to 2010. Your task is again to create an indicator for whether a person had any income in a given year. Using forvalues this is very easy to do:

forvalues year=1990/2010 {
gen hadInc`year'=(inc`year'>0) if inc`year'<.
}

This would be more difficult if the years did not include the century (i.e. 90 instead of 1990) because Stata thinks 100 should come after 99 and not 00. If your data include such years, consider adding the century before doing any serious work with it.

Looping over Values and levelsof

Sometimes you need to loop over the values a particular variable takes on. Consider the following data set:

use http://www.ssc.wisc.edu/sscc/pubs/files/stata_prog/vals.dta

This contains data on the race, income, age and education category of a set of fictional people. Suppose you want to regress income on age and education, but believe that the effects of age and education may be different for people of different races. One approach (probably not the best one) would be to run a separate regression for the people of each race. Normally you could do that with:

by race: regress income age i.education

(The construction i.education tells Stata that education is a factor or categorical variable and should be converted into a set of indicators. See the section on factor variables in Stata for Researchers if you'd like to review factor variable syntax.)

However, this is fictional survey data and you need to correct for the survey design in running regressions. If you're not familiar with Stata's survey commands, that means the following:

  1. The survey design is described using the svyset (survey set) command. This data set has primary sampling units given by the variable psu and probability weights given by the variable weight. The corresponding command svyset command (which has already been run so you don't need to) is:
    svyset psu [pweight=weight]
  2. To have Stata correct for those weights in estimation commands, add the svy: prefix, for example:
    svy: regress income age i.education
  3. You can't use the standard if syntax with survey data or the weights may not be applied correctly. Instead, use the subpop() option of svy:, for example:
    svy, subpop(if race==1): regress income age i.education
  4. by: can't be used with svy:

Point #4 means you can't run your regression for all races using by:, but you can do it with a loop. All by: does is identify the values of race and then loop over them, and at this point you know how to do that yourself (though by: is faster when you can use it). The race variable takes on the values one, two and three, so an appropriate loop is:

forvalues race=1/3 {
svy, subpop(if race==`race'): reg income age i.education
}

What if you had a fourth race, and its number were nine ("Other") rather than four? You could simply recode it and make it four. But if that's not a good idea for your project, you'll have to switch to the less structured foreach loop:

foreach race in 1 2 3 9 {
svy, subpop(if race==`race'): reg income age i.education
}

On the other hand, it's not unusual to have to loop over dozens or even hundreds of values, or not to know ahead of time what values a variable takes on. In that case you can let the levelsof command identify them for you and put them in a macro. The syntax is:

levelsof variable, local(macro)

For example,

levelsof race, local(races)

will list all the values of the variable race and store them in a macro called races. You can then loop over all of them with:

foreach race in `races' {
svy, subpop(if race==`race'): reg income age i.education
}

However, this situation is common enough that Stata wrote special code for parsing macros into lists for looping. The syntax is:

foreach race of local races {
svy, subpop(if race==`race'): reg income age i.education
}

Note that races is not in the usual macro quotes: the whole point of this construction is to bypass the regular macro processor in favor of code that's faster in the context of loops. It makes a very small difference, but if you do enough looping it will add up.

One feature you'll miss from by: is the text in the output telling you which by group is currently being worked on, but you can add it yourself. The following version of the loop adds a display command that inserts two blank lines and then prints the current value of the race macro before running the regression:

foreach race of local races {
display _newline(2) "Race=`race'"
svy, subpop(if race==`race'): reg income age i.education
}

Using display to print out the value of a macro at a given point in your program is also a very useful tool for debugging.

Keep in mind that this was just an example. A better way to examine the effect of race would probably be to interact race with the other variables. The new syntax for factor variables and interactions makes this very easy:

svy: regress income i.race##(c.age i.education)

This model contains all the previous models--if you're new to regressions that include interactions, figuring out why that is might be a good exercise.

Nested Loops

The commands contained in a loop can include other loops:

forval i=1/3 {
forval j=1/3 {
display "`i',`j'"
}
}

This code creates the following output:

1,1
1,2
1,3
2,1
2,2
2,3
3,1
3,2
3,3

The inner loop (the one that uses j) is executed three times, once for each value of i. Thus the display command runs a total of nine times. Note how the display command is indented twice: once because it is part of the i loop and once because it is part of the j loop. When you start working with nested loops it's even more important that you can easily tell what each loop contains.

Consider one final data set:

use http://www.ssc.wisc.edu/sscc/pubs/files/stata_prog/monthyear.dta

This contains monthly income data, but for the period 1990-2010. The variable names are in the form incJan1990, incFeb1990, etc. To generate a set of corresponding indicators you need to loop over both the months and the years:

forval year=1990/2010 {
foreach month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec {
gen hadInc`month'`year'=(inc`month'`year'>0) if inc`month'`year'<.
}
}

This is certainly workable, but somewhat cumbersome. It would be especially awkward if you were interested in lags, leads, or changes over time: you'd need code to tell Stata that the month before January 1991 is December 1990. For most purposes it's easier if time periods are simply numbered sequentially. In this case January 1990 would be period 1, December 1990 would be period 12 and January 1991 period 13. Fortunately it's fairly easy to switch:

local period 1
forval year=1990/2010 {
foreach month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec {
rename inc`month'`year' inc`period'
rename hadInc`month'`year' hadInc`period'
local period=`period'+1
}
}

The macro period is used as a counter. It starts out set to 1, and thus as the nested loops begin incJan1990 is renamed inc1 (and similarly hadIncJan1990 to hadInc1). The command local period=`period'+1 increases period by one: once the macro processor is done with it Stata proper sees local period=1+1. That completes the inner loop, so month is changed to Feb, and incFeb1990 is renamed to inc2. The period macro is increased again (Stata proper now sees local period=2+1), month is set to Mar, incMar1990 is renamed to inc3, and so forth until all 252 months are converted. (Note that 1990 to 2010 inclusive is 21 years.)

In making this conversion you lose the ability to look at a variable and know immediately what calendar month it describes. But it's much easier to loop over. The nested loops can be replaced with:

forvalues period=1/252 {

The Importance of Naming Conventions

The variable name incJan1990 contains three components: the thing being observed (income) and the month and year in which it is observed. The loops we wrote depend on the variable names describing all three in a consistent way: they would fail if the data set contained incJan1990 along with incomeJan1991, incjan1992, incJanuary1993 or incJan94. In the real world such things are not unusual. Data sets from surveys are a particular challenge because their variable names often come from the form of the questionnaire rather than the information they contain. Taking the time to rename your variables in a way that makes sense to you is a good idea at the beginning of any project, but if you'll be using loops it's vital that you create and apply a consistent naming convention for variables.

Take Advantage of Stata's Automatic Loops

Now that you've learned how to use loops, it can be tempting to use them for everything. Keep in mind that most Stata commands are already loops (do something to observation one, then do it to observation two, etc.) and those loops are much faster than any foreach or forvalues loop. For example, the following uses forvalues to loop over all the observations in the data set and set the value of y for each observation to the value of x for that observation:

gen y=.
forvalues i=1/`=_N' {
replace y=x[`i'] if _n==`i'
}

but you'll get the exact same result far more quickly and easily with:

gen y=x

Occasionally someone finds a task that really does requires explicit looping over observations, but it's rare.

Clever programming can sometimes turn other loops into the standard loop over observations, making foreach or forvalues unnecessary. For example, reshaping wide form panel data into long form will eliminate the need for many loops.

Go back to the original 12 months of income data:

use http://www.ssc.wisc.edu/sscc/pubs/files/stata_prog/months.dta

Recall that we created hadInc indicator variables with the following loop:

foreach month in Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec {
gen hadInc`month'=(inc`month'>0) if inc`month'<.
}

However, you'll get the same results with the following:

reshape long inc, i(id) j(month) string
gen hadInc=(inc>0) if inc<.
reshape wide inc hadInc, i(id) j(month) string

(Take a moment to examine the data after each step.)

Reshaping a large data set is time consuming, so don't switch between wide form and long form lightly. But if you can identify a block of things you need to do that would be easier to do in long form, it may be worth reshaping at the beginning and end of that block.

Last Revised: 2/27/2012