|
Almost as soon as you start writing Stata code, you start looking for ways
to write code faster and with less errors. One solution is to make one piece
of code do more than one thing. While this may make the code a bit more complex
and harder to debug, it saves having to write and debug a separate piece of
code for each task. This publication will teach you how to write this kind
of flexible code. We will cover local macros, programs, loops, and a few miscellaneous
tools along the way. We'll end by writing our own brand new Stata command
(an ado file).
This publication includes many examples which are also available as do files
(and one ado file). Links to these files are included within the publication,
but you may want to get them all at once so you are not interrupted as you
work. If you are using an SSCC Linux server, the following commands will create
a directory called stataprog and save all the files there.
mkdir ~/stataprog
cd ~/stataprog
cp /usr/global/web/sscc/pubs/files/4-15/* .
You can also download all of them from the web by going to this list
of files. Note that the files are written assuming they will
be run on an SSCC Linux server. In particular, some of them load
the auto data set from /software/stata/auto.
You will need to change that to the directory where Stata is installed
on the computer you are using. For example, on the Winstats you
need to change it to "c:\program files\stata9\auto".
Now on to the programming tools.
Local Macros
Local macros are somewhat like variables in programming languages. They are
"boxes" where you can store things and pull them our later. This
allows you to write code that will do different things depending on the value
of the macros at the time it is run.
Macros are easy to define; try typing the following:
local x=1
display `x'
The first line defines a local macro called x
and sets it equal to 1. The second displays `x'.
It is critical you see that the single quotation marks around the `x'
are not the same (how different they look depends on the font). The left quotation
mark (`) is found under the tilde (~), usually
in the upper left corner of the keyboard. The right quotation mark (')
is found under the double quotation mark (") usually in the center-right
of the keyboard. You must put the left quote before the macro name and the
right quote after, or Stata will complain at you.
While macros can be used like variables, they are not really variables. What
really happens is that macros are replaced by the text they contain before
Stata interprets the command. So in our example display
`x' is exactly the same as typing display 1.
All macros are stored as strings, even numbers. In fact we don't even need
the equals sign in the macro definition unless we want Stata to do some math
first.
local x 1
is the same as
local x=1
and Stata will actually process the first a bit quicker. The following example
should show you when to use the equals sign. Note that display
"stuff" means to display the stuff in the parentheses as
a string, without evaluating it.
local x 2+2
di `x'
di "`x'"
local x=2+2
di `x'
di "`x'"
In the first case the local macro x contains
"2+2", as we could see when we display
x as a string. In the second case 2+2
was evaluated to be 4 and then stored in x.
Here's a test: What will the following code display (note that ^2 means raise
to the second power)?
local x -2
di `x'^2
If you guessed 4, you forgot either the precedence of algebraic operators or
how Stata uses macros. `x' is replaced by -2
before Stata does anything with it, so it sees -2^2.
But the power takes precedence over the minus sign, so this is the same as
-(2^2), not (-2)^2. If `x' were a variable
like in other programming languages, the minus sign would not be separate
from the 2.
The nice thing about macros not being variables is that you can put almost
anything in them and use them absolutely anywhere. You can even include macros
in macro definitions. Try:
local i=`i'+1
Right now, `i' has nothing in it. So what just
happened? An undefined macro will be replaced by nothing. So what Stata saw
was
local i=+1
which is perfectly legal. In most cases however, using an undefined macro will
lead to syntax errors. If you mistype a macro name, Stata will assume you
meant some other, currently undefined macro--this can be a particularly difficult
error to debug.
Run that command again:
local i=`i'+1
This time `i' does have some content, so Stata
sees
local i=1+1
and `i' is set to 2. You could use a command
like this to increment a counter.
Macros are perfectly legal in file names for log files and data sets. For example
if you were creating separate data sets by race and sex, you could just define
macros for race and sex and then use them in the save command. If you type
local race Black
local sex Women
save `race'`sex'
then Stata creates a data set called .
If the save command followed
local race White
local sex Men
it would create .
You could also use macros as a replacement for copy and paste in your text
editor: assign a macro to a short piece of code that is repeated in your program
and just use type the macro instead of the code. But this generally makes
your code much harder to read. For example:
tab `macro1'
could be tabulating anything, and could include if
or in conditions and options. In order to know
what's going on, you have to find the most recent definition of macro1.
On the other hand, if used wisely macros can make your code clearer. The key
is to use well-named macros to substitute for logical chunks of code. For
example, if you had a big list of control variables that you used constantly,
you could define the list as a a macro called controls.
Then instead of
reg income edu race occupation location... (many
more control variables)
you could type
reg income edu `controls'
and be done. Or if you repeatedly deal with subsamples of your data, you could
define a macro that gave the conditions for that subsample. For example
reg income edu `controls' if race=="black"
& sex=="female"
could also be done as
local blackWomen race=="black" &
sex=="female"
reg income edu `controls' if `blackWomen'
You could save a bit of typing by including the if
within the macro; clearly it only makes sense when following an if.
But you don't want to in order to preserve the readability of your code. If
you include the if, it's not clear what the
macro `BlackWomen' does. But if
`BlackWomen' makes it fairly obvious that the macro just gives the
particular conditions that define a black woman in your data set. In both
of these cases, the macros (perhaps arguably) make your code clearer by hiding
the details of the implementation.
To see the above examples in action, run macro.do.
Programs
A program allows you to define a chunk of code in one place and run it repeatedly.
You can also pass in parameters which will be stored as macros, then use those
macros in various ways within the program.
A Stata program is just some Stata code with the line program
define name at the beginning and the
line end at the end. It is a tradition that
the first program you write in a new language simply display the message "Hello
World" (who starts these traditions?) so let's do that. Type:
program define hello
1. di "Hello World"
2. end
Note that Stata provides the line numbers for you; you will not put them in
when you write a program in a do file (see the example code at the end). To
run your program, type hello. Okay, now you've
paid your dues to tradition. More importantly you now understand the mechanics
of writing a program. Now any time you want to say "Hello World"
all you have to do is type hello. What a time-saver!
But even supposing you had a reason to say "Hello World" once, to
say it more than once in exactly the same way seems a bit redundant. You need
to add some flexibility to your program, and that's where the macros come
in.
When running a program, anything typed after the program name will be interpreted
as arguments. Arguments for programs work much like mathematical functions:
the program does whatever it does depending on its arguments. Within the program,
arguments are referenced by number: `1' is
the first thing after the program name, `2'
the second, etc. (spaces define where one argument ends and the next begins).
Change your program a bit:
program define hello
1. di "Hello `1'"
2. end
But you never got that far, did you? Instead you got an ugly message saying
You can't have two hello programs at once. You
need to get rid of the original by typing:
program drop hello
This can be a minor nuisance if you're running Windows Stata or
GUI Stata on Linux, where you can't really be sure what has been
going on before your do file is executed. If you try to define a
program that's already been defined, your .do
file will crash with the message you just saw. If you try to drop
a program that hasn't been defined (for example if you tried program
drop hello twice) your do file will crash again, with the
message
The solution lies in the capture command. When
a command is preceded by capture, any errors
it generates are ignored (they are captured by capture).
So capture program drop hello will get rid
of a program called hello if it exists, and do nothing if it doesn't.
Now back to your modified hello program. You see that it now calls upon the
local macro `1'. Try running it by typing hello
and see what it does. Since we haven't defined `1',
it is ignored. More precisely, the `1' is replaced
by its value: absolutely nothing. Now try typing
hello Russell
The program responds .
hello Russell Dimond
just displays because Dimond
is stored in `2', and our program doesn't do
anything with `2'. On the other hand, the macro
`0' contains all the arguments, which gives
us some additional flexibility. Let's change our hello
program one more time:
program drop hello
program define hello
1. di "Hello `0'"
2. end
Now try:
hello
hello Russell
hello Russell Dimond, how are you today?
If you are an excellent typist you may have missed an important lesson, but
the rest of us got it: never try to input a program interactively. One mistake
and you have to drop the program and start all over again. Always define programs
in .do files.
Finally. if you have ever used a general-purpose programming language (FORTRAN,
C/C++, Java, whatever) or if you work with someone who has, be prepared for
a bit of confusion about nomenclature. The logical equivalent of a Stata .do
file in these languages is a program, while the logical equivalent of a Stata
program is (depending on the language) a subroutine, function, procedure,
or method (at any rate just a part of a program). You'll probably hear people
refer to do files as programs all the time (I do it), and don't be confused
if someone starts calling Stata a program a subroutine.
The file program.do will run all these
examples.
Loops
Most Stata commands are really loops. Stata carries out the command for the
first observation, then the second, and so forth. Take
advantage of this looping structure whenever you can, because it is quite
fast. But it's not hard to imagine other loops you might want: for example,
you might want to execute the same command for five different variables.
Stata allows you to do these too--you just have to write them yourself.
foreach
The foreach command allows you to create loops
that loop over a list of things.
foreach macroname in/of [list
type] list {
code involving `macroname'...
}
macroname is
a name we choose to represent the elements in our our list. As always, make
the name informative. The in/of construct means
you will use either in or of
(not both), depending on the type of list. You'll use in
for generic lists, of for all others. The list
type is optional. If it is omitted Stata will interpret what
follows as a generic list. Finally there will be the list to be acted on,
and then a left curly bracket. Note the placement of the brackets: the
first one must be part of the same statement as the foreach
(it cannot go on the next line like in C/C++ or Java) and the last one must
be its own statement (it cannot go at the end of the last command inside
the loop). Everything inside the curly brackets will be executed once for
each item in the list, and macroname
is a local macro that will contain each item in the list in
turn. Let's look at an example:
foreach color in red blue green {
1. di "`color'"
2. }
will give the output
Note that, as with programs, Stata gives you line numbers when
you type a foreach loop interactively but you will not need
to type them in do files. Using in with
no list type
indicates a generic list. Stata makes no attempt to interpret what
follows other than break it up into elements. Thus our loop runs
three times, once for each element. The first time, `color'
is set to the word red, the second
time to blue, and the third time
to green.
It's very common for your list of items to be stored in a macro
that was contructed earlier in the program. You can use such
a macro directly in a foreach command:
local colors red blue green
foreach color in `colors' {
1. di "`color'"
2. }
However, this is so common that Stata wrote special code to handle
this case more efficiently.
local colors red blue green
foreach color of local colors {
1. di "`color'"
2. }
Note that in changed to of because
local is officially a list
type, if a rather odd one. Also note that colors is
not in quotes in the foreach command.
If it were in quotes, the standard macro processor would expand
it out to red blue green. Instead,
we let the local list
type look up what the macro means, which it does very
quickly.
Normally list
types tell Stata tell what types of things are in
your list. The available types are varlist, newlist,
and numlist.
For the next example, we'll use the auto data set that comes installed
with Stata. Load it by typing
sysuse auto
(The sysuse command loads a file
from whatever directory Stata is installed in--it's only useful
for examples.)
The varlist construction specifies that what
follows is an official list of variables. That's not quite as important
as it sounds, because you can also put variable names in generic lists. But
compare the following:
foreach var in price mpg rep78 {
1. di "`var'"
2. sum `var'
3. }
foreach var of varlist price-rep78 {
1. di "`var'"
2. sum `var'
3. }
foreach var in price-rep78 {
1. di "`var'"
2. sum `var'
3. }
In the first case, foreach interpreted the
list as three words, each of which the
sum command later recognized as variable names.
In the second case, foreach was forwarned to
expect a variable list, and thus interpreted price-rep78 as
a list of three variables. However, in the third case foreach had
no such warning and interpreted price-rep78 as
a single word. As a result the loop was actually executed just once. It was
the
sum command that later interpreted price-rep78 as a variable list containing three variables.
newlist is for lists of new variables; variables
which do not yet exist but will be created inside the loop. For example:
foreach var of newlist x1 x2 x3 x4 x5 {
1. gen `var'=0
2. }
newlist checks to make sure the list only contains valid variable names, but
does not actually create the variables--gen does that.
numlist is for lists of numbers. Compare this
with the previous:
foreach i of numlist 1/5 {
1. gen y`i'=0
2. }
Note how the `i' macro acts like a subscript
to the y variable. This is a very common construction:
population`year', income`wave',
etc.
forvalues
On the other hand, looping over a list of evenly-spaced numbers is the specialty
of forvalues, and it will do it faster than foreach.
Also, since foreach has to construct the whole
list of numbers before it can start, it can only handle relatively small
lists. forvalues has not such limit. It's quicker to type too:
forvalues i=1/5 {
1. gen z`i'=0
2. }
forvalues isn't limited to counting upwards by one--type help
forvalues for
details on other contructions.
Use loop.do to run all these examples.
Stored Results
This section may not be a programming topic, but it is a tool we'll use in
our final example. And it's good to know anyway.
Many Stata commands store values in an internal array you can access
once you know it's there. Estimation commands create an array called
e( ), and you can see what's in it
by typing ereturn list. Almost all
other commands that return results put them in an array called r(
), and you can see what's in r( )
by typing return list. The manuals
also describe what each command returns. The only trick is that
every command that uses the e( )
or r( ) arrays overwrites the previous
contents. So if you want to do anything with the results of a command,
you must do it before you issue another command that returns values.
One option is to save the results in a variable or local macros
for later use. Try the following:
reg price weight foreign mpg
ereturn list
sum weight
return list
If you want to demean weight (subtract the mean from all observations), all
you have to do is type
replace weight = weight - r(mean)
Try it and then do
sum weight
again to see the results. Note that there are issues with numerical precision,
but you've accomplished your purpose. Keep in mind that you have also replaced
the old values of the r( ) array with a new
set of values referring to the second time you ran summarize.
Good thing you were done with the old results.
To see this in practice, take a look at results.do.
A Program to Demean Data
Let's put together everything you've learned by writing a program that demeans
data. This is a simple enough task that a program isn't really needed, but
we'll go a step further and make it both flexible and error-resistant. In
other words, we'll put a lot more effort into it than it's worth (except as
a learning experience, of course).
We'll start with the simplest possible version (which is generally a good idea
when programming). It will take one argument, a variable name, and demean
that variable.
program define demean
1. sum `1',meanonly
2. replace `1'=`1'-r(mean)
3. end
Try it out and see how it does (just reload the auto data set if you start
running out of variables with non-zero means).
That's fine as far as it goes. But suppose you wanted to demean 20 different
variables? It's time to add a foreach loop.
Recall that local macro `0' (zero) contains
all the arguments passed in to the program. We could have our foreach loop
work with this as a variable list, or even a generic list. But local
was created for exactly this kind of situation and will run a bit faster.
So the next version is:
program drop demean
program define demean
1. foreach var of local 0 {
2. sum `var',meanonly
3. replace `var'=`var'-r(mean)
4. }
5. end
There's just one problem with your demean program. To see it type demean
make. The make variable is
a string. It has no mean, and so your program crashes. Now, you
may be thinking that anyone who tries to demean a string deserves
what's coming to them, but let's fix it anyway, just so you can
learn how. You may not be able to demean a string, but you can give
a better error message, and then proceed to demean any other variables
that were requested and are valid.
If as a Way to Control Program
Flow
You're used to using if at the end of commands.
That meant "execute the preceding command for a given observation only
if this condition is true for that observation." What you're going to
do now is very different. You're going to say "only execute the following
commands for ANY observation if this condition is true." The condition
itself is also different: it is a scalar. It is evaluated just once, not once
for each observation. If the condition includes a variable, the value of that
variable for the first observation will be used. It is also possible to combine
if with else, so you can make arbitrarily complex sets of conditions. The
syntax looks like this (this is a fairly complex example so can you see how
all the pieces work--we'll do something simpler in our program):
if condition1 {
commands to execute if condition1 is true...
}
else if condition2 {
commands to execute if condition one is false and
condition2 is true...
}
else {
commands to execute if both condition1 and condition2
are false...
}
Note how the brackets have to be placed just like with foreach.
The problem with your program is that as soon as Stata sees you try to subtract
something from a string variable, it crashes with the message
before it even looks at any observations. So your job is to detect strings
before you try to demean them, and only subtract things that can be subtracted.
You can do this using the confirm command.
It's a bit like assert in that you use it to
check on things you believe to be true, but it's designed for programmers.
Thus it allows you to check things like that a file actually exists, or in
this case, that a variable is numeric and thus has a mean. The syntax is
confirm numeric variable var
where var is
the variable you're checking. It will do nothing if the variable is numeric,
and cause an error if it is not. But you don't want it to crash the program,
so put capture in front of it.
But how will you know the result if you use capture?
Every command creates a variable called _rc
when it runs, which is short for return code. A return code of zero
means the command was successful. Any other value means something
went wrong (different errors give different return codes). So all
you have to do is check the value of _rc
with an if statement. If _rc is zero,
you know the variable is numeric and you can demean it. If not,
you give an error message but the program continues to run and processes
the rest of the variables.
program drop demean
program define demean
1. foreach var of local 0 {
2. capture confirm numeric variable `var'
3. if _rc==0 {
4. sum `var',meanonly
5. replace `var'=`var'-r(mean)
6. }
7. else di "`var' is not a numeric variable and cannot be demeaned."
8. }
9. end
The file demean.do contains and demonstrates
all the various versions of the demean program. You'll also notice some comments
and a great deal of indenting to make the logical structure easy to see. Both
practices are highly recommended.
ado (Automatic Do) files
You now have a nice little program that could be useful in a variety of settings.
But you have to run the code that defines it before you can use it. What if
you could make it act like any other Stata command and run as soon as you
type it? You can, by making it an ado (automatic do) file.
An ado file is just like a do file that defines a program, but
the filename ends with .ado and it is stored in one of several ado
directories. When you type a command, Stata checks the ado directories
to see if there is an ado file with that name. If there is, Stata
automatically runs the ado file that defines the program and then
executes it. Thus from the user's perspective, using an ado file
is just like using a built-in Stata command. In fact many Stata
commands are actually implemented as ado files.
In order to create an ado file, you need isolate the demean program in a separate
file and save it as demean.ado in your personal
ado directory. You can identify your personal ado directory by typing sysdir.
On the SSCC's Linux servers, it is ~/ado/personal
(recall that ~ means your home directory). On
the Winstats it is w:\ado\personal.
Once that's done, demean.ado will almost be like an official Stata command.
Not quite though: note that we made no provision for standard Stata syntax like
by: or if. Doing
so isn't actually as hard as you might think, but still beyond the scope of
this publication.
You've now learned a powerful set of tools that can save you a great deal of
time and trouble. At first you may need to consciously look for opportunities
to use them. But they will soon become second nature, and writing code without
them will seem unbearably tedious. Consider that progress.
|