*This is part five of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.*

Tasks that require working with groups are common and can range from the very simple ("Calculate the mean mpg of the domestic cars and the foreign cars separately") to the very complex ("Model student performance on a standardized test, taking into account that the students are grouped into classes which are grouped into schools which are grouped into districts."). Fortunately, working with groups is one of Stata's greatest strengths. In this article we'll discuss tools for working with groups, but at the same time try to give you more experience using Stata's syntax to get useful results. In the next section (Hierarchical Data) we'll introduce a theoretical framework for thinking about grouped data, but it will make more sense if you've had some experience working with groups first.

We'll start by going through some basic tools that are used for working with groups, and some tricks for using them. In doing so we'll use one of the example data sets for this series, households.dta. Make sure you copied those files from X:\SSCC Tutorials\StataResearch or downloaded them, and put them in a convenient location like U:\StataResearch. Make sure your current working directory is set to that location:

cd U:\StataResearch

(or wherever you put them). Then load the data:

use households

This data set contains information on twenty fictional people who live in six different households. This data structure is one of the most common encountered at the SSCC. One variable that may require explanation is rel2head, or "relationship to the head of household." It is a categorical variable that takes on three values, with value labels applied. Type label list to see them. This is typical of real-world data (except real data usually have more kinds of relationships).

The most important tool for working with groups is *by*, since it allows you work with each group separately--in this case each household. If you need a review, see the section on *by* in Usage and Syntax. Here are some examples of things you can do with *by*.

Find the average age of the adults in each household:

by household: sum age if age>=18

(You could get the same results more compactly with tab household if age>=18, sum(age))

Store the total household income in a new variable:

by household: egen householdIncome=total(income)

Note that householdIncome is the same for all the individuals living in a given household. That's because it's a characteristic of the household, not the individual. We'll talk more about this distinction in Hierarchical Data.

Create an indicator for whether a household has children or not, regardless of number:

gen child=(age<18)

by household: egen hasChildren=max(child)

If a household has no children, the maximum value of child will be zero. If it has any at all, the maximum will be one.

In this case, child is likely to be a useful variable in its own right. But if you didn't need it, you could do the whole process in one line with:

by household: egen hasChildren=max(age<18)

Now instead of finding the max of a variable, you're finding the max of an expression, but the result is the same: the maximum will be one for the entire household if the household has any children in it and zero otherwise.

Find the number of children in each household:

by household:egen numChildren=total(child)

Here we take advantage of the fact that the total of an indicator variable is the number of observations for which the indicator variable is true.

Suppose we need to store the mean age of the adults in each household as a variable. The obvious starting point would be:

by household: egen meanAdultAge=mean(age) if age>=18

However, meanAdultAge receives a missing for all the children in the data set. That's because the *if* condition does two things in this command: it controls which which observations are used in calculating the mean to be stored in meanAdultAge, but also which observations that mean is stored in. If we need the household's meanAdultAge to be available in all the observations for that household (and we usually do), then we need to "spread" the result to the other observations.

by household: egen temp=mean(meanAdultAge)

drop meanAdultAge

rename temp meanAdultAge

All the observations in each household that have a value for meanAdultAge have the same value. Thus the mean() function returns that value—but it does so for all the observations in the household. (Recall that when mean() encounters missing values it essentially ignores them and calculates the mean of the non-missing values.) Thus the temp variable contains the proper value of meanAdultAge for all observations, adults and children. We then drop the old meanAdultAge variable and rename temp meanAdultAge. If we plan ahead we can save one line of code compared to the above:

by household: egen temp=mean(age) if age>=18

by household: egen meanAdultAge=mean(temp)

drop temp

This is sometimes called "spreading" a result: if you can find the right answer for some of the observation in a group, you can then spread it out to the others. You could do spreading with any of several egen functions: min(), max(), etc., but mean() is perhaps the most intuitive.

- Create an indicator variable for childless households using the numChildren variable you created earlier. Defend your choice whether or not to use
*by*in the process. (Solution) - Find the age of the youngest adult in each household at the time their first child was born. (Hint: this is a characteristic of the household, not an individual.) (Solution)
- Find the mean household income of people in single-parent households and two-parent households. (Solution)

Most Stata commands are actually loops: do something to observation one, then do it to observation two and so forth. As Stata works through this loop, it tracks which observation it is working on in an internal variable called _n. You are welcome to use this variable in your commands:

l if _n==5

will only list observation five, because the condition _n==5 is only true when Stata is working with observation five.

_n becomes even more useful when combined with *by. *Suppose you wanted to list the first observation in each household:

by household: l if _n==1

It just so happens that the first observation is the head of household in every case, which is not unusual. But what if instead of having rel2head you only knew the head of household by their location in the household?

First, you'd have to be very careful about sorting. Stata's default sort algorithm is not "stable," meaning that if you sort by household it may change the order of observations within the household. If the order of observations matters, you should add the stable *option* to any sort commands. That way Stata will use a different sort algorithm that is slower but will not change the order of observations within a group.

Another internal variable, _N, contains the number of observations in the data set. It is also the observation number of the last observation. You can use it in commands just like _n:

by household: l if _n==_N

This lists the last observation in each household.

Often you'll want to have a within-group identifier so you can always tell which observation is which, even after a mistaken sort. In this case the within-group identifier could logically be called person:

by household: gen person=_n

The person variable will correspond to the observation number of the person within their household in the current sort order. If you wanted a globally unique identifier, run the above command without by household:.

Like _n, _N honors by groups. Thus _N contains the number of observations in the *by* group currently being worked on. You can easily find household size with:

by household: gen size=_N

Consider the command:

gen newIncome=income

In carrying it out, Stata looks at one observation at a time, and sets newIncome for that observation equal to income for the same observation. Subscripts allow you to look at the value of a variable for any observation you want. Try:

gen newIncome2=income[1]

income[1] means "the value of income for observation 1." Thus newIncome2 will be 60,000 for all observations (not that that is a useful result).

Consider trying to identify the female-headed households:

by household: gen femaleHead=female[1]

Since the first person in each household is the head, the household has a female head if and only if the first person is female.

What if the head of household were last instead of first? Just change it to:

by household: gen femaleHead=female[_N]

What if the heads of household weren't in any particular place within the household? Make them first:

sort household rel2head

by household: gen femaleHead=female[1]

What if the code for "head of household" weren't the lowest value of rel2head? This following will always work:

gen isHead=(rel2head==1)

sort household isHead

by household: gen femaleHead=female[_N]

What if some households don't have a head, and you need femaleHead to be missing for those households? Do the above, but add an *if* condition to the last line:

by household: gen femaleHead=female[_N] if isHead[_N]

Any time you need to pick out the characteristics of a special row within a group (the respondent to a survey, the month in which a subject graduated, etc.) all you need to do is:

- Create an indicator variable that is one for the special row
- Sort by the group ID and the new indicator variable
- The special row will be last and can be accessed with [_N] as long as you start with
*by*

If you want the special observation to be first rather than last, you can use gsort (generalized sort):

gsort household -isHead

With gsort you can put a minus sign in front of a variable name and the observations will be sorted in descending order by that variable rather than ascending.

The householdIncome variable should have the same value for all the individuals within a given household. You can check that with:

sort household householdIncome

by household: assert householdIncome[1]==householdIncome[_N]

Because the observations within a household are sorted by householdIncome, the smallest value will be first and the largest value will be last. If the first and last values are the same, then you know all the values are the same.

- How could you check that every household has one one only one head of household?
- Create an indicator variable for whether a household's value of age varies. Use it to browse just those households whose age does vary.

Subscripts can contain mathematical expressions, including _n and _N. Consider a data set containing enrollment numbers for ten fictional schools:

use schools, replace

We'll define a student's peer group as everyone in her grade, the grade above her, and the grade below her. To find the size of each grade's peer group, type the following:

by school: gen peerGroup=students+students[_n+1]+students[_n-1]

The result is missing for grade one because it doesn't have a grade before it, and for grade twelve because it doesn't have a grade after it. Thus students[_n-1]or students[_n+1] give missing values for them.

- What would happen if you left out by school: in:
by school: gen peerGroup=students+students[_n+1]+students[_n-1]

What would happen if some schools didn't have an observation for some grades? (Solution)

- Redefine peerGroup to include two grades above and below a student's grade. Furthermore, where that calculation would include grades that don't exist, find peerGroup by adding up the grades that do. (Solution)

Panel data, or data with multiple individuals observed multiple times, can be treated like grouped data even though a "group" in this case is an individual. (This is why we introduce more general terminology in Hierarchical Data.) Consider a data set of five people observed for twenty months, with each person's employment status recorded each month:

use employment, replace

A typical person in this panel is employed for a while, then unemployed for a while, etc. Each period of continuous employment or unemployment is called a "spell" and a common first task with such data is to identify the spells.

Begin by identifying the months which start a new spell, i.e. the months where a person's employment status is different from what it was the previous month:

by person: gen start=(employed!=employed[_n-1])

For the first month in which a person is observed, the quantity employed[_n-1] does not exist and is thus missing. Since employed is never missing (how would you check that?) this guarantees that the first month a person is observed is marked as the start of a new spell.

Next comes something you should add to your bag of tricks:

by person: gen spell=sum(start)

The sum() function finds running sums, i.e. the sum of a variable for all observations up to and including the current observation. Since start is one whenever a spell starts and zero otherwise, sum(start) for an observation is the number of spells which have started up to that point--and that serves as a splendid spell ID.

Once you've identified the spells, you can treat them as groups. However, these spell IDs only make sense within the context of a person (each person has their own spell number one). Thus the proper *by* is by person spell:, and the first time you use it you'll have to say bysort. But everything you've learned still applies. For example, finding the duration of a spell is exactly like finding the size of a household:

bysort person spell: gen duration=_N

Think back to the command:

by person: gen start=(employed!=employed[_n-1])

What would happen if you omitted the

*by*? (Solution)- Create variables containing the start month, start year, end month and end year for each spell. (Solution)
- Find the mean spell length for each person. Make sure the mean is calculated over spells, not months. (Solution)

Next: Hierarchical Data

Previous: Statistics

Last Revised: 2/17/2014