SSCC Publications

Intermediate Stata

Printer Friendly Version

Last Revised:8/25/2004

Once you're familiar with the basics of how Stata works (see An Introduction to Stata if you're not), this publication will allow you to move beyond the obvious. Stata's syntax is logical and structured, but the tricks you can play with it may surprise you. With a bit of experience, you'll very easily have Stata doing things other statistical programs would find extremely difficult. In this publication we'll spend a lot of our time working by example, meaning we'll make up a problem and then find a clever way to solve it. You won't have much trouble finding ways to apply the same principles in your work.

We'll use a fair number of very small files, all of which are available for download. I suggest you get them all before beginning so you won't be interrupted as you work, though there are links to them throughout this document. I have created a separate page listing all the files where you can download them. If you are using an SSCC Linux server, the following Linux commands will create a new directory called intstata and copy all the files into it:

mkdir intstata
cd intstata
cp /usr/global/web/sscc/pubs/files/4-10/* .

General Stata Syntax

Let's start with some general ideas to make sure we have a common vocabulary. The syntax for a Stata command is:

[by varlist:] command [varlist] [=expression] [if expression] [in range] [weights] [, options]

Brackets denote optional elements, though some commands require some elements or cannot use others. Note that a varlist can have any number of variables, though some commands have specific requirements. Stata is very consistent in its use of this syntax, and it can be helpful to understand where all the various parts of the commands you give fit into this scheme.

The second general idea to keep in mind is that Stata executes each command one observation at a time. Essentially each command includes an implicit loop that says "execute this command for each observation."

Good Practices

Next we'll go through some general practices that will not only make you a more efficient programmer, but will help you avoid embarassing errors.

Do Files

The first rule is to always write do files. Interactive mode is good for learning, for error checking, and for exploring your data, but anything you will take seriously needs to be reproducible. That means putting the commands into a do file. If everything is written in a do file, you know exactly what you did and can repeat it on demand. In one case I saw recently, six months into a major project the researchers noticed that some frequencies were extremely implausible--clearly the result of an error. But they had no idea when the error had occurred. Had they done all their work in do files, they could have repeated the analysis, one step at a time, checking the frequencies after each one until they found the mistake. Then they could fix the error and rerun all their do files without having to duplicate any of their previous work (the computer would have been duplicating previous work, but within reason we don't care how much work it does).

A do file is an ASCII text file whose name ends with .do and contain Stata commands. Since do files are ASCII text you can use any text editor you like. I recommend emacs for Linux or TextPad for Windows (you can associate the do extension with TextPad in Windows so do files open automatically). You can use TextPad even if you are using Linux to run Stata--take a look at Running Linux Programs Using Windows (Mostly).

In Linux, you will normally run your do file in batch mode. Just type

> stata -b do dofile &

where dofile is the name of the do file you want to run. Stata will run quietly in the background and exit when it is done. For big jobs, simply replace stata with condor_stata and your job will be submitted to our Condor pool for faster processing. Note that the current working directory for the Stata job will be the current working directory of your shell when you started it.

It is possible to do the same thing in Windows using the command prompt, but this is awkward. Normally you will start a Stata session and then run your do file from within it (this works in Linux Stata too). Just type

do dofile

in the Stata command window. The difference is that your do file will be affected by the current state of your Stata session, and may change that state as it runs. For example, there may already be data in memory, or your do file may leave its log file open. So if you will be running do files this way, you probably want to start them all with:

clear
capture log close
set more off

The clear command removes any data currently in memory so your program starts with a clean slate. log close closes the current log, but this will generate an error (and crash your do file) if there is no log open. Thus we precede it with capture. The capture command tells Stata to ignore any errors generated by the next command--we'll do more with it in Programming in Stata. With capture, we're essentially saying "close the log if it's open, but if it's not don't worry about it." set more off tells Stata not to wait for you to hit a key when the screen fills up.

Logs

If you are running in batch mode, the only way to view your output is to save it in a log and then read the log file after the do file has run. But even if you are running your do file from inside a session, you'll want to save the output for the future. So usually the first thing your do file will do (after any setting up as described above) is to open a log. The command is just

log using logfile.log, replace

where logfile should be the name of the file where you want to save the log. Normally this should be very similar to the name of the do file, so you know which files go together. The replace option tells Stata to go ahead and overwrite a previous log with that name. The default is for Stata to refuse to overwrite existing logs. But this means if you run a do file, find an error, and want to run it again, you must manually delete the old log first.

Comments

My current position has made me a big fan of comments. Anything between /* and */will be ignored by Stata. This allows you to insert explanations of what you are doing and why for humans to read. If anyone else ever needs to read and understand your code, they will be eternally grateful for comments. But the most likely beneficiary is yourself. In six months (let alone ten years) your code might as well have been written by someone else. Commenting is an investment with a very high rate of return. The examples all include comments.

Variable Names

I am amazed and dismayed by some of the gibberish that is used for variable names. There's simply no need to live with variables like H2V06 or worse. Renaming variables to something that is meaningful takes a bit of time, but will save a great deal of time and confusion down the road. Keep in mind that Stata can now use variable names that are up to 32 characters long, and that any variable name can be abbreviated. The command syntax is just ren oldName newName.

Variable names must be one word with no spaces. However capitalization can make it more readable. For comparison try to interpret numinhh vs. numInHH (number in household). Another alternative is to use the underscore (_) as a space: num_in_hh. Personally I get tired of reaching for the underscore key and holding down SHIFT, but it's a matter of taste.

The same applies to the length of variable names. numberInHousehold is very clear, but it's fairly long. numInHH is much shorter, but you pretty much have to know what it is to read it. In general the more often you use a variable the shorter the name you want to give it, both to save typing and because the meaning will be familiar anyway. In addition, Stata will allow you to use any unique abbreviation of your variable names (though this may make your code harder to read). So using numberInHousehold as a variable name may not be so bad if it's the only variable that starts with num, because you can just type num in commands and Stata will know you're referring to numberInHousehold.

Length of do files

Suppose you have a major project that will take several hours of computer time to complete. The last thing you want is to discover is that you mistyped the very last command and have to start all over again. It is far better to break up the project into smaller steps and write a separate do file for each. Not only will this avoid wasting time rerunning good code to fix the bad code that comes after it, but it also makes it easier to test your results after each step, avoiding subtle errors. And because do files can call other do files, it is a simple matter to write a single master do file that runs all the steps you have successfully completed.

Recycling Code

In programming circles there's a huge emphasis on writing reusable code. The best tool for doing it--object oriented programming--is not available in Stata. But you can do what's called modular programming. Just take a moment when you start a project and see if there are any steps that might also apply to future work. If so, isolate them in separate do files and then as much as possible write the code in such a way that it doesn't depend on the particular features of the data set you are working with now. With luck, you'll be able to use that do file in the future and save yourself a lot of time.

Now on to the examples...

Using Reshape

I wrote this section because once for about a month it seemed like the answer to every question I got was reshape. reshape is useful for data sets where the observations fall naturally into groups and subgroups. We'll do two examples. In the first a "group" is a country and a subgroup that country's data for a particular year. In the second the group is a household and the subgroup an individual in that household. In the long form, each subgroup is an observation, in the wide form each group is an observation and each subgroup has its own variable(s). Consider the following miniature data sets (reshape1.dta is the first one) from the Penn World Tables. They both contain the exact same information, the first in long form, the second in wide:

 

country year pop
ALGERIA 1980 18669
ALGERIA 1981 19254
ALGERIA 1982 19862
ALGERIA 1983 20495
ALGERIA 1984 21173
ALGERIA 1985 21848
ALGERIA 1986 22497
ALGERIA 1987 23124
ALGERIA 1988 23758
ALGERIA 1989 24374
ALGERIA 1990 25003
ANGOLA 1980 7581
ANGOLA 1981 7783
ANGOLA 1982 7990
ANGOLA 1983 8202
ANGOLA 1984 8400
ANGOLA 1985 8605
ANGOLA 1986 8841
ANGOLA 1987 9084
ANGOLA 1988 9334
ANGOLA 1989 9590
ANGOLA 1990  

 

country pop1980 pop1981 pop1982 pop1983 pop1984 pop1985 pop1986 pop1987 pop1988 pop1989 pop1990
ALGERIA 18669 19254 19862 20495 21173 21848 22497 23124 23758 24374 25003
ANGOLA 7581 7783 7990 8202 8400 8605 8841 9084 9334 9590  

 

To go from the first format to the second, just type:

reshape wide pop, i(country) j(year)

To go from the second to the first, type:

reshape long pop, i(country) j(year)

i(variable) identifies what constitutes a group, in this case, a country. j(variable) tells us what identifies a subgroup, in this case a year for a given country. Note that year does not appear as a variable in the wide form; instead its values have been appended to the variable name pop. If you had more variables that apply to the subgroups, they would be added to the varlist right after pop.

Using Reshape to Separate Individuals from Household Records

Another common application for reshape is to create separate records for individuals from a data set where each record is a household. For example consider the following:

hh_id income age1 sex1 age2 sex2 age3 sex3
1 30000 30 F 2 F    
2 90000 45 M 43 F 15 M

 

In this data set (reshape2.dta) an observation is one household, but it contains information on the individuals in that household. If we want to do analysis based on the individuals, we'll need to separate them out. Simply type:

reshape long age sex, i(hh_id) j(orderInHH)

And the result will be:

hh_id orderInHH income age sex
1 1 30000 30 F
1 2 30000 2 F
1 3 30000    
2 1 90000 45 M
2 2 90000 43 F
2 3 90000 15 M

 

Note that orderInHH was generated because we told Stata that's what the numbers at the end of the individuals variables meant. We could have omitted j(variable) from the command entirely, in which case Stata would have made a variable called _j. The orderInHH variable may not contain useful information, in which case it can be dropped immediately, but Stata will insist on creating a variable for j when you use reshape.

Notice that the variables in the reshape varlist are interpreted as subgroup (in this case, individual) variables, and anything that is not listed is assumed to be a group (in this case, household) variable.

The problem is person 3 in household 1. She doesn't really exist (household 1 only has two members), but Stata just thinks she has missing values for age and sex. What's worse, in real data sets they need to have as many variables as the largest household in the survey, so there's usually a lot of people that don't exist. The problem is identifying who doesn't exist and who is just missing some variables. My rule of thumb is that if a person is missing all the individual variables, they don't exist. The following code will identify and drop them in this data set. If you have a lot more variables in your data set this command will be a lot longer.

drop if sex=="" & age==.

Note that the code for a missing string is "" rather than .. Now we have a data set of individuals, but each one has all the household level variables as well.

All these commands can be found in reshape.do.

Working with Groups

Data with individuals and households is very common and Stata handles it quite well (if you want a programming challenge, try doing this section in SAS). The key is the by: group. by: groups are processed separately for each unique combination of the by: variables, and this includes having their own unique values of _n (observation number) and _N (number of observations). This allows for some nifty programming tricks.

Continuing to use the households you worked with in the last section, start by finding the size of each household:

bysort hh_id: gen size=_N

by: requires that the data be sorted by the by: variables, but bysort takes care of this automatically, saving you a step. You then set size equal to the number of observations in each household (_N).

Next find the number of adults in each household. Start by creating an indicator variable for adult status:

gen adult=(age>17)

Note how you have set adult equal to a condition. Stata has no Boolean type, so true/false conditions are stored as 1 for true and 0 for false. This makes it very easy to create indicator variables. Note that you can use it the other way too: if adult will be false if adult is zero and true if it has any other value. The next step is to simply add up the number of adults in each household.

by hh_id: egen adults=sum(adult)

Having one variable called adult and another called adults is probably a bad idea, but you only created adult as a step towards calculating adults, so let's drop it:

drop adult

The following code will create an indicator variable for whether the household contains a member that is male:

gen male=(sex=="M")
by hh_id: egen hasMale=max(male)
drop male

The part you haven't seen before is the second line. The max function does exactly what you think it does: return the maximum value of male. But since you're working by hhid: it returns the maximum value of male in that household. If there is no male in the household, then male is always zero and max(male) is zero. If there is a male, then male is one for at least one observation, and max(male) is one.

Two more little tricks, mostly here as puzzles. The first is a way to check if what we think uniquely identifies an observation actually does. In this case the only id variable as such is hh_id. But a combination of hh_id and orderInHH should uniquely identify an individual. Let's see:

bysort hh_id orderInHH: assert _N==1

Recall that assert checks that a condition is true and gives you an error if it is not. In this case you've broken your data set into a separate group for each unique combination of hh_id and OrderInHH, so if there's more than one observation in the group, they do not uniquely identify an observation.

Here's a quick check to see if there are any duplicate observations in your data:

bysort *: assert _N==1

The * is shorthand for all the variables in the data set. So in this case you're making a group for each unique combination of all the variables. If there's more than one observation in one of these groups, they are duplicates. Of course this is redundant now--you know that hh_id and OrderInHH uniquely identify an observation, then there can't be any duplicates.

The code for this section is in hh.do.

Combining Data Sets

Combining data sets is very easy in Stata, though the logic behind it can (like reshape) occasionally be headache-inducing. Stata always works with one data set at a time, so you will always be combining the data set in memory (the master data set) with another data set on disk (the using data set, for reasons that will be clear momentarily).

Appending Data Sets

When appending, data sets are stacked.

Stata calls it appending two data sets when you want to add the observations from the using data set to the master. For example, if you had data on domestic cars in one data set and foreign cars in another, you'd use append to combine them. If the variables are not identical, then the resulting data set will have all the variables used in either data set (the union). Missing values will be assigned as needed. For example if your master data set only has a variable called X and you append a data set that only has a variable called Y, the resulting data set will have both X and Y. Observations from the master data set will have missing values for Y, while observations from the using data set will have missing values for X.

The syntax is simple: load the the master data set into memory and then type

append using dataset

where dataset is the name of the data set you want to append.

Merging Data Sets

When merging, data sets are placed side by side.

In a merge, observations are combined. For example, if you had data on car make and price in one file, and car mileage and repair record in another, you'd want to use merge to combine them. The key to merge is that the data will be combined based on one or more identifying variables you specify. If the variables uniquely identify an observation in both data sets, then you are performing a one-to-one merge. If the variables uniquely identify an observation in one data set but not the other, then you are performing a many-to-one merge (one example would be combining data about individuals with data about the state they live in). If the variables do not identify a unique observation in either data set, then you are probably making a mistake.

The syntax is similar to append, except that you add a varlist with the identifying variables right after the word merge:

merge varlist using dataset

where varlist is the list of identifying variables and dataset is the using dataset (assuming you have already loaded the master data set). Note that both datasets must be sorted by the varlist before you perform the merge. Often this involves loading the using data set, sorting it, saving it, and then loading the master data set and sorting it before you're finally ready to do the merge itself.

Whenever you perform a merge, Stata will create a new variable called _merge. _merge=1 means the observation came just from the master data set. _merge=2 means it came just from the using data set. _merge=3 means it combined information from both (often the ideal). You should normally take a look at the values of _merge (with tab _merge for example), but you'll need to drop it before you can perform any further merges.

Basic appends and merges are simple enough to do; the trick is often getting the data into a form suitable to be merged. And it's sometimes surprising what a merge can be used for, as you'll see in the following examples.

Reading Hierarchical Data

Data files come in all shapes and sizes, but here is one that's tricky enough to make a good example and you may well find something similar someday. Suppose you have a data set containing information on both household and individuals in the following format:

Information on Household 1

Information on Individual 1 of Household 1

Information on Individual 2 of Household 1...

Information on Household 2

Information on Individual 1 of Household 2

Information on Individual 2 of Household 2...

This is often called hierarchical data. The tricky part is that the only thing that tells us which household an individual belongs to is their location in the file; the records do not contain a household ID (yet). hierarchical.txt is an example data set in this format.

The first step is to write a data dictionary, or rather two data dictionaries, one for each record type. I've provided these for you, hh.dct and ind.dct. The important thing is that each record has a type variable that tells whether it is a household or a person (h for household, i for person). This problem would be even more difficult without that; fortunately it is fairly common.

You could read these data by using the data dictionaries with an if condition to ensure we only read records that match the data dictionary we are using:

infile using hh if type=="h"

infile using ind if type=="i"

If the records had a household ID, then it would just be a matter of merging the two resulting data sets by household ID. Since they don't, you'll have to work a little harder. The first step will be to read in the only the household data and generate a household number (we'll call it hh). The household number will be just the observation number (_n), once the data set is just households.

gen byte hh=_n

You'll also need to sort it by hh for future merging and do some housecleaning. Here is the complete code for the first step (hier1.do).

Now you have a data set of all the households, with each given a household number. The trick is to give the individuals the same household numbers. Start by reading in the entire file, households and individuals, using the individual dictionary. This means for the households the only valid variable will be type, but that's all you need.

Next, generate a indicator variable saying whether an observation is a household or not (ishh). It is 1 for households and zero for individuals.

gen byte ishh= (type=="h")

Now you will do a trick. The sum function used with gen (which has nothing whatsoever with the summarize command) will make a running sum of a variable. And it turns out that

gen hh=sum(ishh)

gives you exactly the household number you need. For each observation, hh will be the sum of ishh for all previous observations. For anyone in the first household, hh is 1 (one household with a value of 1 for hh and any number of people with a value of 0 for hh), for the second household it is 2 (two households with 1 each and any number of individuals with 0), and so forth. You can then drop all the households (drop if ishh), do some housecleaning, sort, and you're ready to merge.

The merge is simple, just

merge hh using hier1

Now your data set consists of individuals, but each observation includes all the information about the household they belong to. Here is the complete code for the second part (hier2.do). Now that the project is complete, it's easy to write a single do file that will repeat the entire process. All it needs to contain is

do hier1
do hier2

and proper commenting, of course (hier.do)

Linking Observations Within a Data Set

Our next example will do a lot more with explicit indexing. Make sure you're somewhat comfortable with using _n and _N, and brackets []. As a self-test, consider assert age[_n+1]>=age. This will confirm that the data set has been sorted in ascending order by age--make sure you understand why. It works, but in one case it's probably not doing what you expected--can you see it?

For the last observation age[_n+1] is missing, and since missing is coded as +infinity, it is greater than anything. Thus the assertion is true. But suppose you had sorted the data in descending order using gsort -age. One might think all we have to do is change the "greater than" to "less than" to do the same check. But in that case, the missing value makes the assertion false even though what you thought you were checking is true.

The Example

Suppose you have a data set (link.dta) consisting of individuals, some of which are mother and child. Specifically, each individual has a number (person) and if the person's mother is in the data set, then mother's number is stored in the variable mother (otherwise it is missing). Each individual's age is also stored in age. Your goal will be to find the difference between the mother's age and person's age. Your first thought may be that we can identify someone's mother with if person==mother. But Stata works with one observation at a time, so it will interpret that as requiring that person be equal to mother for the same observation, or in English, if the person is her own mother. You won't be surprised to find that no one fits this criterion.

What you need is a way to map from person to something Stata can understand, namely observation numbers. Start by creating such a map. For your map you only need the variable person, so drop everything else. Next you need a standard order for the observations. Any order you could easily replicate would do, but sorting by person is convenient.

keep person
sort person

Then store each person's observation number in this standard order as obsNum. You'll also need to change the name of person to id, for reasons that will become clear in the next part. Finally, save the result. Take a look at the full code (link1.do).

gen obsNum=_n
ren person id
/*
We're going to use id to refer to two different things, so we give it
a generic name.
*/
save map,replace

Now you need to use your map. You will do this by merging it with the original data, so it will now include observation numbers. But there's a trick: you will have it merge different observations. Specifically you will have it merge the child's observation from the original data with the mother's observation in the map. This means that a child's value of obsNum will be their Mother's observation number, not their own (and yes, you'll rename it shortly).

Do this by creating a variable called id, to match the name you gave it in the map, but here set it equal to mother, not person.

gen id=mother

You then merge by id, meaning Stata will combine observations that have the same id. But id means different things in the different files, so we end up combining the mother's obsNum with the rest of the child's data.

merge id using map

But what about people who's mothers are not in the data set? In these cases, there is no matching id and Stata cannot merge the two observations. So you end up with two copies: one from the original data that is missing obsNum, and one from the map that is missing everthing but id and obsNum. You need to delete any observations that come only from the map. Stata makes this easy: it sets _merge to two for all such observations, so use this with drop.

drop if _merge==2

Now you're almost done, but remember that your observation numbers are only meaningful when the data is sorted by person. And you need to give obsNum a name that is more meaningful for its current role. Finally, obsNum (now momsObsNum) needs a good label.

sort person
ren obsNum momsObsNum
label var momsObsNum "Observation number of Mother when sorted by person"

Now you're ready to use it. The point of this entire exercise is that you can now access information about the mother by putting momsObsNum in brackets. So to find the age difference, use:

gen diff=age[momsObsNum]-age

Congratulations, you made it! Take a look at the full code with all the housekeeping though (link2.do) You'll also want another do file that runs both parts (link.do). If all you were interested in was the age difference, there are probably easier ways to do this. But you can easily imagine a data set with a large number of variables, and you are now ready to work with any number of them.

A limitation of this approach is that the data must be sorted in a particular way. But if it ever were to become necessary to sort it in some other order, you could first store all the mother's information in regular variables.

gen momsAge=age[momsObsNum]

The only disadvantage to this approach is that the same information is stored in two different places, so it is not as memory-efficient.

A parting thought: what if we had data on both fathers and mothers? We could use the exact same code, but it seems a waste to write the same commands twice, with the only difference being the variable name. If you'd like to know a better way, continue on to Programming in Stata.

UW Home Page Article on the Carillon Tower