Stata for Researchers: Working with Groups

This is part five of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

Tasks that require working with groups are common and can range from the very simple ("Calculate the mean mpg of the domestic cars and the foreign cars separately") to the very complex ("Model student performance on a standardized test, taking into account that the students are grouped into classes which are grouped into schools which are grouped into districts."). Fortunately, working with groups is one of Stata's greatest strengths. In this article we'll discuss tools for working with groups, but at the same time try to give you more experience using Stata's syntax to get useful results. In the next section (Hierarchical Data) we'll introduce a theoretical framework for thinking about grouped data, but it will make more sense if you've had some experience working with groups first.

We'll start by going through some basic tools that are used for working with groups, and some tricks for using them. In doing so we'll use one of the example data sets for this series, households.dta. Make sure you copied those files from X:\SSCC Tutorials\StataResearch or downloaded them, and put them in a convenient location like U:\StataResearch. Make sure your current working directory is set to that location:

cd U:\StataResearch

(or wherever you put them). Then load the data:

use households

This data set contains information on twenty fictional people who live in six different households. This data structure is one of the most common encountered at the SSCC. One variable that may require explanation is rel2head, or "relationship to the head of household." It is a categorical variable that takes on three values, with value labels applied. Type label list to see them. This is typical of real-world data (except real data usually have more kinds of relationships).

By

The most important tool for working with groups is by, since it allows you work with each group separately--in this case each household. If you need a review, see the section on by in Usage and Syntax. Here are some examples of things you can do with by.

Calculating Summary Statistics Over Groups

Find the average age of the adults in each household:

by household: sum age if age>=18

(You could get the same results more compactly with tab household if age>=18, sum(age))

Store the total household income in a new variable:

by household: egen householdIncome=total(income)

Note that householdIncome is the same for all the individuals living in a given household. That's because it's a characteristic of the household, not the individual. We'll talk more about this distinction in Hierarchical Data.

Identifying Characteristics of a Group

Create an indicator for whether a household has children or not, regardless of number:

gen child=(age<18)
by household: egen hasChildren=max(child)

If a household has no children, the maximum value of child will be zero. If it has any at all, the maximum will be one.

In this case, child is likely to be a useful variable in its own right. But if you didn't need it, you could do the whole process in one line with:

by household: egen hasChildren=max(age<18)

Now instead of finding the max of a variable, you're finding the max of an expression, but the result is the same: the maximum will be one for the entire household if the household has any children in it and zero otherwise.

Counting Observations that Meet a Condition

Find the number of children in each household:

by household:egen numChildren=total(child)

Here we take advantage of the fact that the total of an indicator variable is the number of observations for which the indicator variable is true.

Result Spreading

Create a variable incomeOfHead containing the head-of-household's income:

gen temp=income if rel2head==1
list
by household: egen incomeOfHead=mean(temp)
drop temp

In this example you first create a temp variable containing the income of just the heads of households (and do a list to see the result). You then find its mean for each household: since there's just one observation with a valid value, the "mean" is just that value. Finally you drop the no-longer-needed temp variable.

This is sometimes called "spreading" a result: if you can find the right answer for one observation in the group, you can then spread it out to the others. You could do the spreading with any of several egen functions: min(), max(), etc., but mean() is perhaps the most intuitive.

Another common scenario that leads to result spreading is using egen to calculate a result for a subsample of the data and then needing to spread that result to the entire data set. For example, let's calculate the average age of the adults in each household again, but this time store the results in a variable:

by household: egen temp=mean(age) if age>=18
list
by household: egen meanAdultAge=mean(temp)
drop temp

The if condition in the first line is needed so that the mean is only calculated over the adults, but it also means that the result is only stored for the adults. The children get missing values. Thus you need to spread the result to the children as well.

Note that just because all these examples involved individuals in households did not mean every command had to start with by household:. by is only needed for commands that work across observations in some way. Many commands, even in the context of groups, only look at one observation at a time. Those do not need to start with by.

Exercises

  1. Create an indicator variable for childless households using the numChildren variable you created earlier. Defend your choice whether or not to use by in the process. (Solution)
  2. Find the age of the youngest adult in each household at the time their first child was born. (Hint: this is a characteristic of the household, not an individual.) (Solution)
  3. Find the mean household income of people in single-parent households and two-parent households. (Solution)

_n and _N

Most Stata commands are actually loops: do something to observation one, then do it to observation two and so forth. As Stata works through this loop, it tracks which observation it is working on in an internal variable called _n. You are welcome to use this variable in your commands:

l if _n==5

will only list observation five, because the condition _n==5 is only true when Stata is working with observation five.

You could do the same thing with in (l in 5). But while in can't be combined with by, _n honors by groups by restarting from one when Stata goes from one by group to the next.

Suppose you wanted to list the first observation in each household:

by household: l if _n==1

It just so happens that the first observation is the head of household in every case, which is not unusual. But what if instead of having rel2head you only knew the head of household by their location in the household?

First, you'd have to be very careful about sorting. Stata's default sort algorithm is not "stable," meaning that if you sort by household it may change the order of observations within the household. If the order of observations matters, you should add the stable option to any sort commands. That way Stata will use a different sort algorithm that is slower but will not change the order of observations within a group.

Creating Within-Group Identifiers

But it's even better to make a within-group identifier so you can always tell which observation is which, even after a mistaken sort. In this case the within-group identifier could logically be called person:

by household: gen person=_n

If Conditions with Observation Numbers

Second, if the head of household is always the first observation, _n means something, at least when it's one. Thus you could replace:

gen temp=income if rel2head==1

(since we're assuming rel2head no longer exists) with:

by household: gen temp=income if _n==1

The by household: had to be added not because the command looks across observations, but so that _n is reset to one when Stata reaches a new household.

Another internal variable, _N, contains the number of observations in the data set. It is also the observation number of the last observation. You can use it in commands just like _n:

l if _n==_N

will list the last observation.

Finding the Size of a Group

Like _n, _N honors by groups. Thus _N contains the number of observations in the by group currently being worked on. You can easily find household size with:

by household: gen size=_N

Subscripts

Consider the command:

gen newIncome=income

In carrying it out, Stata looks at one observation at a time, and sets newIncome for that observation equal to income for the same observation. Subscripts allow you to look at the value of a variable for any observation you want. Try:

gen newIncome2=income[1]

income[1] means "the value of income for observation 1." Thus newIncome2 will be 60,000 for all observations (not that that is a useful result).

Spreading Characteristics of a Special Observation

Consider trying to identify the female-headed households:

by household: gen femaleHead=female[1]

Since the first person in each household is the head, the household has a female head if and only if the first person is female.

Calculations Based on an Observation's Neighbors

Subscripts can contain mathematical expressions, including _n and _N. Consider a data set containing enrollment numbers for ten fictional schools:

use schools, replace

We'll define a student's peer group as everyone in her grade, the grade above her, and the grade below her. To find the size of each grade's peer group, type the following:

by school: gen peerGroup=students+students[_n+1]+students[_n-1]

The result is missing for grade one because it doesn't have a grade before it, and for grade twelve because it doesn't have a grade after it. Thus students[_n-1]or students[_n+1] give missing values for them.

Exercises

  1. Redefine peerGroup to include two grades above and below a student's grade. Furthermore, where that calculation would include grades that don't exist, find peerGroup by adding up the grades that do. (Solution)
  2. Going back to the original command:

    by school: gen peerGroup=students+students[_n+1]+students[_n-1]

    What would happen if you left out by school:? What would happen if some schools didn't have an observation for some grades? (Solution)

Panel Data

Panel data, or data with multiple individuals observed multiple times, can be treated like grouped data even though a "group" in this case is an individual. (This is why we introduce more general terminology in Hierarchical Data.) Consider a data set of five people observed for twenty months, with each person's employment status recorded each month:

use employment, replace

Identifying Spells

A typical person in this panel is employed for a while, then unemployed for a while, etc. Each period of continuous employment or unemployment is called a "spell" and a common first task with such data is to identify the spells.

Begin by identifying the months which start a new spell, i.e. the months where a person's employment status is different from what it was the previous month:

by person: gen start=(employed!=employed[_n-1])

For the first month in which a person is observed, the quantity employed[_n-1] does not exist and is thus missing. Since employed is never missing (how would you check that?) this guarantees that the first month a person is observed is marked as the start of a new spell.

Next comes something you should add to your bag of tricks:

by person: gen spell=sum(start)

The sum() function finds running sums, i.e. the sum of a variable for all observations up to and including the current observation. Since start is one whenever a spell starts and zero otherwise, sum(start) for an observation is the number of spells which have started up to that point--and that serves as a splendid spell ID.

Once you've identified the spells, you can treat them as groups. However, these spell IDs only make sense within the context of a person (each person has their own spell number one). Thus the proper by is by person spell:, and the first time you use it you'll have to say bysort. But everything you've learned still applies. For example, finding the duration of a spell is exactly like finding the size of a household:

bysort person spell: gen duration=_N

Exercises

  1. Think back to the command:

    by person: gen start=(employed!=employed[_n-1])

    What would happen if you omitted the by? (Solution)

  2. Create variables containing the start month, start year, end month and end year for each spell. (Solution)
  3. Find the mean spell length for each person. Make sure the mean is calculated over spells, not months. (Solution)

Next: Hierarchical Data

Previous: Statistics

Last Revised: 2/8/2010