|
Once you're familiar with the basics of how Stata works (see An
Introduction to Stata if you're not), this publication will allow you
to move beyond the obvious. Stata's syntax is logical and structured, but
the tricks you can play with it may surprise you. With a bit of experience,
you'll very easily have Stata doing things other statistical programs would
find extremely difficult. In this publication we'll spend a lot of our time
working by example, meaning we'll make up a problem and then find a clever
way to solve it. You won't have much trouble finding ways to apply the same
principles in your work.
We'll use a fair number of very small files, all of which are available for
download. I suggest you get them all before beginning so you won't be interrupted
as you work, though there are links to them throughout this document. I have
created a separate page listing all the files
where you can download them. If you are using an SSCC Linux server, the following
Linux commands will create a new directory called intstata
and copy all the files into it:
mkdir intstata
cd intstata
cp /usr/global/web/sscc/pubs/files/4-10/* .
General Stata Syntax
Let's start with some general ideas to make sure we have a common vocabulary.
The syntax for a Stata command is:
[by varlist:]
command [varlist] [=expression]
[if expression] [in
range] [weights] [,
options]
Brackets denote optional elements, though some commands require some elements
or cannot use others. Note that a varlist can
have any number of variables, though some commands have specific requirements.
Stata is very consistent in its use of this syntax, and it can be helpful
to understand where all the various parts of the commands you give fit into
this scheme.
The second general idea to keep in mind is that Stata executes each command
one observation at a time. Essentially each command includes an implicit loop
that says "execute this command for each observation."
Good Practices
Next we'll go through some general practices that will not only make you a
more efficient programmer, but will help you avoid embarassing errors.
Do Files
The first rule is to always write do files. Interactive mode is good for learning,
for error checking, and for exploring your data, but anything you will take
seriously needs to be reproducible. That means putting the commands into a
do file. If everything is written in a do file, you know exactly what you
did and can repeat it on demand. In one case I saw recently, six months into
a major project the researchers noticed that some frequencies were extremely
implausible--clearly the result of an error. But they had no idea when the
error had occurred. Had they done all their work in do files, they could have
repeated the analysis, one step at a time, checking the frequencies after
each one until they found the mistake. Then they could fix the error and rerun
all their do files without having to duplicate any of their previous work
(the computer would have been duplicating previous work, but within reason
we don't care how much work it does).
A do file is an ASCII text file whose name ends with .do and contain Stata
commands. Since do files are ASCII text you can use any text editor you like.
I recommend for Linux or
for Windows (you can associate the do extension with
in Windows so do files open automatically). You can use
even if you are using Linux to run Stata--take a look at Running Linux Programs Using Windows (Mostly).
In Linux, you will normally run your do file in batch mode. Just type
> stata -b do dofile &
where dofile
is the name of the do file you want to run. Stata will run quietly in the
background and exit when it is done. For big jobs, simply replace stata
with condor_stata and your job will be submitted
to our Condor pool for faster processing. Note that the current working directory
for the Stata job will be the current working directory of your shell when
you started it.
It is possible to do the same thing in Windows using the command prompt, but
this is awkward. Normally you will start a Stata session and then run your do
file from within it (this works in Linux Stata too). Just type
do dofile
in the Stata command window. The difference is that your do file will be affected
by the current state of your Stata session, and may change that state as it
runs. For example, there may already be data in memory, or your do file may
leave its log file open. So if you will be running do files this way, you
probably want to start them all with:
clear
capture log close
set more off
The clear command removes any data currently
in memory so your program starts with a clean slate. log
close closes the current log, but this will generate an error (and
crash your do file) if there is no log open. Thus we precede it with capture.
The capture command tells Stata to ignore any
errors generated by the next command--we'll do more with it in Programming
in Stata. With capture, we're essentially
saying "close the log if it's open, but if it's not don't worry about
it." set more off tells Stata not to wait
for you to hit a key when the screen fills up.
Logs
If you are running in batch mode, the only way to view your output is to save
it in a log and then read the log file after the do file has run. But even
if you are running your do file from inside a session, you'll want to save
the output for the future. So usually the first thing your do file will do
(after any setting up as described above) is to open a log. The command is
just
log using logfile.log,
replace
where logfile
should be the name of the file where you want to save the log. Normally this
should be very similar to the name of the do file, so you know which files
go together. The replace
option tells Stata to go ahead and overwrite a previous log with that name.
The default is for Stata to refuse to overwrite existing logs. But this means
if you run a do file, find an error, and want to run it again, you must manually
delete the old log first.
Comments
My current position has made me a big fan of comments. Anything between /*
and */will be ignored by Stata. This allows
you to insert explanations of what you are doing and why for humans to read.
If anyone else ever needs to read and understand your code, they will be eternally
grateful for comments. But the most likely beneficiary is yourself. In six
months (let alone ten years) your code might as well have been written by
someone else. Commenting is an investment with a very high rate of return.
The examples all include comments.
Variable Names
I am amazed and dismayed by some of the gibberish that is used for variable
names. There's simply no need to live with variables like H2V06 or worse.
Renaming variables to something that is meaningful takes a bit of time, but
will save a great deal of time and confusion down the road. Keep in mind that
Stata can now use variable names that are up to 32 characters long, and that
any variable name can be abbreviated. The command syntax is just ren
oldName newName.
Variable names must be one word with no spaces. However capitalization can
make it more readable. For comparison try to interpret numinhh
vs. numInHH (number in household). Another
alternative is to use the underscore (_) as a space: num_in_hh.
Personally I get tired of reaching for the underscore key and holding down
SHIFT, but it's a matter of taste.
The same applies to the length of variable names. numberInHousehold
is very clear, but it's fairly long. numInHH
is much shorter, but you pretty much have to know what it is to read it. In
general the more often you use a variable the shorter the name you want to
give it, both to save typing and because the meaning will be familiar anyway.
In addition, Stata will allow you to use any unique abbreviation of your variable
names (though this may make your code harder to read). So using numberInHousehold
as a variable name may not be so bad if it's the only variable that starts
with num, because you can just type num
in commands and Stata will know you're referring to numberInHousehold.
Length of do files
Suppose you have a major project that will take several hours of computer time
to complete. The last thing you want is to discover is that you mistyped the
very last command and have to start all over again. It is far better to break
up the project into smaller steps and write a separate do file for each. Not
only will this avoid wasting time rerunning good code to fix the bad code
that comes after it, but it also makes it easier to test your results after
each step, avoiding subtle errors. And because do files can call other do
files, it is a simple matter to write a single master do file that runs all
the steps you have successfully completed.
Recycling Code
In programming circles there's a huge emphasis on writing reusable code. The
best tool for doing it--object oriented programming--is not available in Stata.
But you can do what's called modular programming. Just take a moment when
you start a project and see if there are any steps that might also apply to
future work. If so, isolate them in separate do files and then as much as
possible write the code in such a way that it doesn't depend on the particular
features of the data set you are working with now. With luck, you'll be able
to use that do file in the future and save yourself a lot of time.
Now on to the examples...
Using Reshape
I wrote this section because once for about a month it seemed like the answer
to every question I got was reshape. reshape
is useful for data sets where the observations fall naturally into groups
and subgroups. We'll do two examples. In the first a "group" is
a country and a subgroup that country's data for a particular year. In the
second the group is a household and the subgroup an individual in that household.
In the long form, each subgroup is an observation, in the wide form each group
is an observation and each subgroup has its own variable(s). Consider the
following miniature data sets (reshape1.dta
is the first one) from the Penn World
Tables. They both contain the exact same information, the first in long
form, the second in wide:
| country |
year |
pop |
| ALGERIA |
1980 |
18669 |
| ALGERIA |
1981 |
19254 |
| ALGERIA |
1982 |
19862 |
| ALGERIA |
1983 |
20495 |
| ALGERIA |
1984 |
21173 |
| ALGERIA |
1985 |
21848 |
| ALGERIA |
1986 |
22497 |
| ALGERIA |
1987 |
23124 |
| ALGERIA |
1988 |
23758 |
| ALGERIA |
1989 |
24374 |
| ALGERIA |
1990 |
25003 |
| ANGOLA |
1980 |
7581 |
| ANGOLA |
1981 |
7783 |
| ANGOLA |
1982 |
7990 |
| ANGOLA |
1983 |
8202 |
| ANGOLA |
1984 |
8400 |
| ANGOLA |
1985 |
8605 |
| ANGOLA |
1986 |
8841 |
| ANGOLA |
1987 |
9084 |
| ANGOLA |
1988 |
9334 |
| ANGOLA |
1989 |
9590 |
| ANGOLA |
1990 |
|
| country |
pop1980 |
pop1981 |
pop1982 |
pop1983 |
pop1984 |
pop1985 |
pop1986 |
pop1987 |
pop1988 |
pop1989 |
pop1990 |
| ALGERIA |
18669 |
19254 |
19862 |
20495 |
21173 |
21848 |
22497 |
23124 |
23758 |
24374 |
25003 |
| ANGOLA |
7581 |
7783 |
7990 |
8202 |
8400 |
8605 |
8841 |
9084 |
9334 |
9590 |
|
To go from the first format to the second, just type:
reshape wide pop, i(country) j(year)
To go from the second to the first, type:
reshape long pop, i(country) j(year)
i(variable) identifies what constitutes a group,
in this case, a country. j(variable) tells
us what identifies a subgroup, in this case a year for a given country. Note
that year does not appear as a variable in
the wide form; instead its values have been appended to the variable name
pop. If you had more variables that apply to
the subgroups, they would be added to the varlist
right after pop.
Using Reshape to Separate Individuals from Household Records
Another common application for reshape is to
create separate records for individuals from a data set where each record
is a household. For example consider the following:
| hh_id |
income |
age1 |
sex1 |
age2 |
sex2 |
age3 |
sex3 |
| 1 |
30000 |
30 |
F |
2 |
F |
|
|
| 2 |
90000 |
45 |
M |
43 |
F |
15 |
M |
In this data set (reshape2.dta) an observation
is one household, but it contains information on the individuals in that household.
If we want to do analysis based on the individuals, we'll need to separate
them out. Simply type:
reshape long age sex, i(hh_id) j(orderInHH)
And the result will be:
| hh_id |
orderInHH |
income |
age |
sex |
| 1 |
1 |
30000 |
30 |
F |
| 1 |
2 |
30000 |
2 |
F |
| 1 |
3 |
30000 |
|
|
| 2 |
1 |
90000 |
45 |
M |
| 2 |
2 |
90000 |
43 |
F |
| 2 |
3 |
90000 |
15 |
M |
Note that orderInHH was generated because we
told Stata that's what the numbers at the end of the individuals variables
meant. We could have omitted j(variable) from
the command entirely, in which case Stata would have made a variable called
_j. The orderInHH
variable may not contain useful information, in which case it can be dropped
immediately, but Stata will insist on creating a variable for j when you use
reshape.
Notice that the variables in the reshape varlist
are interpreted as subgroup (in this case, individual) variables, and anything
that is not listed is assumed to be a group (in this case, household) variable.
The problem is person 3 in household 1. She doesn't really exist (household
1 only has two members), but Stata just thinks she has missing values for
age and sex.
What's worse, in real data sets they need to have as many variables as the
largest household in the survey, so there's usually a lot of people that don't
exist. The problem is identifying who doesn't exist and who is just missing
some variables. My rule of thumb is that if a person is missing all the individual
variables, they don't exist. The following code will identify and drop them
in this data set. If you have a lot more variables in your data set this command
will be a lot longer.
drop if sex=="" & age==.
Note that the code for a missing string is ""
rather than .. Now we have a data set of individuals,
but each one has all the household level variables as well.
All these commands can be found in reshape.do.
Working with Groups
Data with individuals and households is very common and Stata handles it quite
well (if you want a programming challenge, try doing this section in SAS).
The key is the by: group. by:
groups are processed separately for each unique combination of the by:
variables, and this includes having their own unique values of _n
(observation number) and _N (number of observations).
This allows for some nifty programming tricks.
Continuing to use the households you worked with in the last section, start
by finding the size of each household:
bysort hh_id: gen size=_N
by: requires that the data be sorted by the by:
variables, but bysort takes care of this automatically,
saving you a step. You then set size equal
to the number of observations in each household (_N).
Next find the number of adults in each household. Start by creating an indicator
variable for adult status:
gen adult=(age>17)
Note how you have set adult equal to a condition. Stata has no Boolean type,
so true/false conditions are stored as 1 for true and 0 for false. This makes
it very easy to create indicator variables. Note that you can use it the other
way too: if adult will be false if adult is
zero and true if it has any other value. The next step is to simply add up
the number of adults in each household.
by hh_id: egen adults=sum(adult)
Having one variable called adult and another
called adults is probably a bad idea, but you
only created adult as a step towards calculating
adults, so let's drop it:
drop adult
The following code will create an indicator variable for whether the household
contains a member that is male:
gen male=(sex=="M")
by hh_id: egen hasMale=max(male)
drop male
The part you haven't seen before is the second line. The max
function does exactly what you think it does: return the maximum value of
male. But since you're working by
hhid: it returns the maximum value of male
in that household. If there is no male in the household, then male
is always zero and max(male) is zero. If there
is a male, then male is one for at least one observation, and max(male)
is one.
Two more little tricks, mostly here as puzzles. The first is a way to check
if what we think uniquely identifies an observation actually does. In this
case the only id variable as such is hh_id.
But a combination of hh_id and orderInHH
should uniquely identify an individual. Let's see:
bysort hh_id orderInHH: assert _N==1
Recall that assert checks that a condition is
true and gives you an error if it is not. In this case you've broken your
data set into a separate group for each unique combination of hh_id
and OrderInHH, so if there's more than one
observation in the group, they do not uniquely identify an observation.
Here's a quick check to see if there are any duplicate observations in your
data:
bysort *: assert _N==1
The * is shorthand for all the variables in
the data set. So in this case you're making a group for each unique combination
of all the variables. If there's more than one observation in one of these
groups, they are duplicates. Of course this is redundant now--you know that
hh_id and OrderInHH
uniquely identify an observation, then there can't be any duplicates.
The code for this section is in hh.do.
Combining Data Sets
Combining data sets is very easy in Stata, though the logic behind it can (like
reshape) occasionally be headache-inducing.
Stata always works with one data set at a time, so you will always be combining
the data set in memory (the master data set) with another data set on disk
(the using data set, for reasons that will be clear momentarily).
Appending Data Sets

Stata calls it appending two data sets when you want to add the observations
from the using data set to the master. For example, if you had data on domestic
cars in one data set and foreign cars in another, you'd use append to combine
them. If the variables are not identical, then the resulting data set will
have all the variables used in either data set (the union). Missing values
will be assigned as needed. For example if your master data set only has a
variable called X and you append a data set that only has a variable called
Y, the resulting data set will have both X and Y. Observations from the master
data set will have missing values for Y, while observations from the using
data set will have missing values for X.
The syntax is simple: load the the master data set into memory and then type
append using dataset
where dataset
is the name of the data set you want to append.
Merging Data Sets

In a merge, observations are combined. For example, if you had data on car
make and price in one file, and car mileage and repair record in another,
you'd want to use merge to combine them. The key to merge is that the data
will be combined based on one or more identifying variables you specify. If
the variables uniquely identify an observation in both data sets, then you
are performing a one-to-one merge. If the variables uniquely identify an observation
in one data set but not the other, then you are performing a many-to-one merge
(one example would be combining data about individuals with data about the
state they live in). If the variables do not identify a unique observation
in either data set, then you are probably making a mistake.
The syntax is similar to append, except that you add a varlist with the identifying
variables right after the word merge:
merge varlist using dataset
where varlist
is the list of identifying variables and dataset
is the using dataset (assuming you have already loaded the master data set).
Note that both datasets must be sorted by the varlist
before you perform the merge. Often this involves loading the using data set,
sorting it, saving it, and then loading the master data set and sorting it
before you're finally ready to do the merge itself.
Whenever you perform a merge, Stata will create a new variable called _merge.
_merge=1 means the observation came just from
the master data set. _merge=2 means it came
just from the using data set. _merge=3 means
it combined information from both (often the ideal). You should normally take
a look at the values of _merge (with tab
_merge for example), but you'll need to drop it before you can perform
any further merges.
Basic appends and merges are simple enough to do; the trick is often getting
the data into a form suitable to be merged. And it's sometimes surprising
what a merge can be used for, as you'll see in the following examples.
Reading Hierarchical Data
Data files come in all shapes and sizes, but here is one that's tricky enough
to make a good example and you may well find something similar someday. Suppose
you have a data set containing information on both household and individuals
in the following format:
Information on Household 1
Information on Individual 1 of Household 1
Information on Individual 2 of Household 1...
Information on Household 2
Information on Individual 1 of Household 2
Information on Individual 2 of Household 2...
This is often called hierarchical data. The tricky part is that the only thing
that tells us which household an individual belongs to is their location in
the file; the records do not contain a household ID (yet). hierarchical.txt
is an example data set in this format.
The first step is to write a data dictionary, or rather two data dictionaries,
one for each record type. I've provided these for you, hh.dct
and ind.dct. The important thing is that
each record has a type variable that tells
whether it is a household or a person (h for household, i for person). This
problem would be even more difficult without that; fortunately it is fairly
common.
You could read these data by using the data dictionaries with an if
condition to ensure we only read records that match the data dictionary we
are using:
infile using hh if type=="h"
infile using ind if type=="i"
If the records had a household ID, then it would just be a matter of merging
the two resulting data sets by household ID. Since they don't, you'll have
to work a little harder. The first step will be to read in the only the household
data and generate a household number (we'll call it hh).
The household number will be just the observation number (_n),
once the data set is just households.
gen byte hh=_n
You'll also need to sort it by hh for future
merging and do some housecleaning. Here is the complete code for the first
step (hier1.do).
Now you have a data set of all the households, with each given a household
number. The trick is to give the individuals the same household numbers. Start
by reading in the entire file, households and individuals, using the individual
dictionary. This means for the households the only valid variable will be
type, but that's all you need.
Next, generate a indicator variable saying whether an observation is a household
or not (ishh). It is 1 for households and zero
for individuals.
gen byte ishh= (type=="h")
Now you will do a trick. The sum function used
with gen (which has nothing whatsoever with
the summarize command) will make a running
sum of a variable. And it turns out that
gen hh=sum(ishh)
gives you exactly the household number you need. For each observation, hh
will be the sum of ishh for all previous observations.
For anyone in the first household, hh is 1
(one household with a value of 1 for hh and
any number of people with a value of 0 for hh),
for the second household it is 2 (two households with 1 each and any number
of individuals with 0), and so forth. You can then drop all the households
(drop if ishh), do some housecleaning, sort,
and you're ready to merge.
The merge is simple, just
merge hh using hier1
Now your data set consists of individuals, but each observation includes all
the information about the household they belong to. Here is the complete code
for the second part (hier2.do). Now that
the project is complete, it's easy to write a single do file that will repeat
the entire process. All it needs to contain is
do hier1
do hier2
and proper commenting, of course (hier.do)
Linking Observations Within
a Data Set
Our next example will do a lot more with explicit indexing. Make sure you're
somewhat comfortable with using _n and _N,
and brackets []. As a self-test, consider assert
age[_n+1]>=age. This will confirm that the data set has been sorted
in ascending order by age--make sure you understand
why. It works, but in one case it's probably not doing what you expected--can
you see it?
For the last observation age[_n+1] is missing,
and since missing is coded as +infinity, it is greater than anything. Thus
the assertion is true. But suppose you had sorted the data in descending order
using gsort -age. One might think all we have
to do is change the "greater than" to "less than" to do
the same check. But in that case, the missing value makes the assertion false
even though what you thought you were checking is true.
The Example
Suppose you have a data set (link.dta) consisting
of individuals, some of which are mother and child. Specifically, each individual
has a number (person) and if the person's mother
is in the data set, then mother's number is stored in the variable mother
(otherwise it is missing). Each individual's age is also stored in age.
Your goal will be to find the difference between the mother's age and person's
age. Your first thought may be that we can identify someone's mother with
if person==mother. But Stata works with one
observation at a time, so it will interpret that as requiring that person
be equal to mother for the same observation,
or in English, if the person is her own mother. You won't be surprised to
find that no one fits this criterion.
What you need is a way to map from person to
something Stata can understand, namely observation numbers. Start by creating
such a map. For your map you only need the variable person,
so drop everything else. Next you need a standard order for the observations.
Any order you could easily replicate would do, but sorting by person
is convenient.
keep person
sort person
Then store each person's observation number in this standard order as obsNum.
You'll also need to change the name of person
to id, for reasons that will become clear in
the next part. Finally, save the result. Take a look at the full code (link1.do).
gen obsNum=_n
ren person id
/*
We're going to use id to refer to two different things, so we give it
a generic name.
*/
save map,replace
Now you need to use your map. You will do this by merging it with the original
data, so it will now include observation numbers. But there's a trick: you
will have it merge different observations. Specifically you will have it merge
the child's observation from the original data with the mother's observation
in the map. This means that a child's value of obsNum
will be their Mother's observation number, not their own (and yes, you'll
rename it shortly).
Do this by creating a variable called id, to
match the name you gave it in the map, but here set it equal to mother,
not person.
gen id=mother
You then merge by id, meaning Stata will combine
observations that have the same id. But id
means different things in the different files, so we end up combining the
mother's obsNum with the rest of the child's
data.
merge id using map
But what about people who's mothers are not in the data set? In these cases,
there is no matching id and Stata cannot merge
the two observations. So you end up with two copies: one from the original
data that is missing obsNum, and one from the
map that is missing everthing but id and obsNum.
You need to delete any observations that come only from the map. Stata makes
this easy: it sets _merge to two for all such
observations, so use this with drop.
drop if _merge==2
Now you're almost done, but remember that your observation numbers are only
meaningful when the data is sorted by person.
And you need to give obsNum a name that is
more meaningful for its current role. Finally, obsNum
(now momsObsNum) needs a good label.
sort person
ren obsNum momsObsNum
label var momsObsNum "Observation number of Mother when sorted by person"
Now you're ready to use it. The point of this entire exercise is that you can
now access information about the mother by putting momsObsNum
in brackets. So to find the age difference, use:
gen diff=age[momsObsNum]-age
Congratulations, you made it! Take a look at the full code with all the housekeeping
though (link2.do) You'll also want another
do file that runs both parts (link.do). If
all you were interested in was the age difference, there are probably easier
ways to do this. But you can easily imagine a data set with a large number
of variables, and you are now ready to work with any number of them.
A limitation of this approach is that the data must be sorted in a particular
way. But if it ever were to become necessary to sort it in some other order,
you could first store all the mother's information in regular variables.
gen momsAge=age[momsObsNum]
The only disadvantage to this approach is that the same information is stored
in two different places, so it is not as memory-efficient.
A parting thought: what if we had data on both fathers and mothers? We could
use the exact same code, but it seems a waste to write the same commands twice,
with the only difference being the variable name. If you'd like to know a
better way, continue on to Programming in Stata.
|