Data Wrangling in Stata: Hierarchical Data

This is part five of Data Wrangling in Stata.

Many data sets involve some sort of hierarchical structure. The American Community Survey is an example of one of the most common hierarchical data structures: individuals grouped into households. Another common hierarchical data structure is panel or longitudinal data and repeated measures, where things are observed multiple times. These structures may seem very different, but the same concepts apply to both and often even the same code.

Describing Hierarchical Data

Hierarchical data can be be described in terms of levels (note that these are not the same as the levels of a categorical variable). A level one unit is the smallest unit in the data set. A level two unit is then a group of level one units. Some examples:

Example Data Structure Level One Unit Level Two Unit
Individuals living in households Individuals Households
States grouped into regions States Regions
Students in classes Students Classes
Tests taken by students Tests Students
Monthly observations of individuals Monthly observations Individuals
Visits to the doctor by patients Doctor visits Patients
Social media posts by individuals Posts Individuals

If the hierarchy has more than two levels, simply keep counting up: if you have students grouped into classes grouped into schools grouped into districts, students are level one, classes are level two, schools are level three, and districts are level four.

Each variable is associated with a specific level. A variable is a level two variable if all the level one units within a given level two unit have the same value for the variable. For data structures where a level two unit is observed over time, level two variables are variables that do not change over time.

Example Data Structure Level One Variables Level Two Variables
Individuals living in households Age, Sex Household income, Address
States grouped into regions State population Regional income per capita
Students in classes Student's race Teacher's race, Class size
Tests taken by students Test Score Free lunch eligibility
Monthly observations of individuals Employment Status Highest degree earned
Visits to the doctor by patients BMI, Diagnosis Race
Social media posts by individuals Length, Topic Sex

Of course all of these depend on the details of the actual data set. In a study that observes individuals for a few months, it's unlikely that their highest degree earned will change. But it might! And if it does, highest degree earned becomes a level one variable. It does not meet the definition of a level two variable because different level one units (monthly observations) have different values for it.

With a hierarchical data set, an observation (row) can represent either a level one unit or a level two unit. Consider observing two people for three months:

personID month yearsEdu employed
1 1 16 1
1 2 16 1
1 3 16 1
2 1 12 0
2 2 12 1
2 3 12 1

Exercise: Identify the level one units, level two units, level one variable(s), and level two variable(s) in this data set.

In this form, each observation represents a month (or more precisely, a person-month combination). Now consider the exact same data in a different form:

personID yearsEdu employed1 employed2 employed3
1 16 1 1 1
2 12 0 1 1

In this form, each observation represents a person. Because yearsEdu is a level two variable, there's just one value per person and thus one variable (column). However, employed is a level one variable with three (potentially) different values per person. Thus it needs three variables (columns).

We call the first form the long form (or occasionally the tall form) and the second the wide form. The long form is longer because it has more rows; the wide form is wider because it has more columns. In most cases the long form is easier to work with, so we'll do most of our examples in this form.

Now consider the identifiers in this data set. In the long form, personID and month were a compound identifier for the rows in the data set, while the variable names were a simple column identifier. In the wide form, personID is a simple row identifier, but now the variable names for the level one variables are a compound identifier with two parts: the variable name (employed) and the month in which the variable was observed. To identify a specific value in the data set we still need to know the personID, month, and a variable name, but month has been converted from a row identifier to part of a compound column identifier.

Before proceeding, start a do file that loads 2000_acs_cleaned.dta. This is one of the provided example data sets, but if you did everything in First Steps with your Data (including the optional exercises) you can use the data set you created instead.

capture log close
log using level2.log, replace
clear all
use 2000_acs_cleaned

Creating Level Two Variables Based on Summary Statistics

A very common data wrangling task is creating level two variables based on the level one variables in the data set. If the data set is in long form, one critical tool for doing so is the by prefix, so that commands are executed separately for each level two unit. Another is egen, with its aggregate functions that act across observations.

Continuous Variables

We'll start with examples of calculating continuous (or quantitative) level two variables based on continuous level one variables.

For many purposes, the total income of a household is more relevant than individual incomes. You can calculate it with a combination of by and egen's total() function:

by household: egen householdIncome = total(income)

Now calculate the income per person. Since this is the same as the mean income for the household, you might think you can use:

by household: egen incomePerPerson = mean(income)

In fact this does not give the right answer, as you can see by examining household 484 (browse if household==484). The reason is that young children have missing values for income,which prompts egen to completely ignore them in the calculation. Thus for household 484, incomePerPerson is the total income of the household divided by the number of people who have a non-missing value for income, two, when it should be divided by the number of people in the household, four.

Fortunately, these egen functions can act on mathematical expressions, not just single variables. Since mean(x) = total(x)/N = total(x/N), we can instead use:

by household: egen realIncomePerPerson = total(income/_N)

_N is a system variable, or a variable Stata always keeps track of and you can use even though you never created it. It contains the number of observations in the data set, or, if you are using by, the number of observations in the by group. In this case that's the number of people in the household. You might find that useful directly:

by household: gen householdSize = _N

Because Stata numbers observations starting from 1, _N is also the observation number of the last observation. We'll find that useful as well.

Exercise: Create a variable for the mean age of all the individuals in the household.

Subsetting with egen

Sometimes the way egen ignores missing values can be useful. Consider trying to calculate the total income earned by the children of a household. You might think you could run:

by household: egen householdChildIncome = total(income) if age<18

Examine household 8787. As you see, householdChildIncome has the right number, but it's not a proper household-level variable in that the parent has a missing value instead. For many purposes that would make it unusable. The reason is that egen used the condition if age<18 for two different purposes, one that you want and one that you don't. The first is that in calculating the total it only includes observations where age<18, which is exactly what you want. The second is that it only stores the result in observations where age<18, leaving the other observations with missing values. This is not what you want.

An easy way to remedy this is to create a new variable which is 1 for observations you want to include in the calculation and missing for observations you do not:

gen toUse=1 if age<18

Next, have egen act on the variable of interest times the variable you just created:

by household: egen realhouseholdChildIncome = total(income*toUse)

For the observations you want to use, multiplying the variable of interest by 1 changes nothing. For the observations you don't want to use, multiplying by missing gives a missing value egen will ignore. This allows you to have egen do its calculations over a subset of the data without using an if condition, so it will store the result in all observations.

Exercise: Create a level two variable called meanAdultAge containing the mean age of the adults in the household. Make sure it has the same value for all the members of the household, including any children.

Indicator Variables

When an indicator variable is coded one/zero, the egen functions used above take on new and useful meanings. For example, the total of an indicator variable is the number of observations for which the indicator is one. Create a variable containing the number of children in each household with:

gen child=(age<18)
by household: egen numChildren = total(child)

The mean of an indicator variable is the proportion of observations for which the indicator is one. Create a variable contain the proportion of each household that is below 18 with:

by household: egen propChildren = mean(child)

Next consider the maximum value of an indicator variable:

by household: egen hasChildren = max(child)

If a household has no children in it, then child is always zero and the maximum value is zero. If a household has any children in it, then those children have a one for child and the maximum is one. Thus hasChildren is a household-level indicator variable for "this household has children in it."

More generally, applying the max() function to an indicator variable creates a new indicator variable which is one for all observations (or all observations within a by group) if the original indicator variable is one for any observation.

You can use min() in the same way: the result will be one if the indicator you apply it to is one for all observations, but the result will be zero if any observation has a zero. Use that to create an indicator variable for "all the people in this household are children":

by household: egen allChildren = min(child)

All of these functions can act on an expression rather than a variable, including conditions. Thus we could have used total(age<18), mean(age<18), max(age<18), etc. rather than creating the indicator variable child and gotten the same results.

This data set has no missing values for age, but it's worth thinking about what this code would do with them. The condition (age<18) returns a zero for observations where age is missing (recall that you can think of missing as infinity), as if the person were known to be an adult. Thus in the presence of missing values numChildren isn't really the number of children in the household, it's the number of people known to be children. But that may be exactly what you want.

Exercise: create variables containing the number of college graduates in each household (Associate's degree or above), the proportion of the household members which are college graduates, and an indicator variable for "this household contains at least one college graduate." Then create an indicator variable for "all the adults in this household are college graduates." Remember that edu has missing values, but since the people with missing values of edu are all less than three years old you can safely assume they're not college graduates.

Creating Level Two Variables in Wide Form

Start a new do file that loads 2000_acs_wide.dta:

capture log close
log using level2_wide.log, replace
clear all
use 2000_acs_wide

This is the exact same data set as before, but in wide form. Now there is just one row for each household, but it contains all the data about all the individuals in it. The income1 variable contains the income of the first person in the household, income2 the second, etc.

Now consider adding up the total income of the household. In wide form, instead of using the total() function we need the rowtotal() function. It adds things up just like total(), but while total() adds up the values of a single variable across multiple observations, rowtotal() adds up the values of multiple variables within a single observation.

However, the input rowtotal() needs is quite different. Rather than acting on a single mathematical expression, it acts on a list of variables, or varlist. When a Stata command or function takes a varlist this means both that it needs a list of variables and that it will understand certain shortcuts for specifying that list. In this case we want to act on all the income variables, but there are sixteen of them (one household has sixteen people in it) and typing them all out would be tiresome. So we'll take a brief digression into shortcuts for specifying lists of variables.

Shortcuts for Variable Lists

The most common shortcut is to use the asterisk (*) as a wildcard character. Try:

describe income*

This tells the describe command to act on all variables that match the pattern "income followed by anything." Note that "anything" can include nothing, so a variable just called "income" would be included as well. The wildcard can go anywhere:

describe *1

This matches all the variables with information about the first individual in the household, but also the variables with information about the eleventh individual. Be careful your wildcards don't match more than what you want!

A question mark (?) is also a wildcard, but it matches exactly one character:

describe income?

This matches income1 through income9, but not income10 because it is income followed by two characters.

Another shortcut is to put a dash between two variables. This will give you all the variables in between them:

describe age1-hispanic1

This gives you just the variables with information about the first individual. The order used in resolving this shortcut is the order the variables are listed in the variables window or a describe command. You can use the order command to put the variables in a convenient order.

A varlist can mix multiple kinds of shortcuts as well as individual variable names:

describe household age1-hispanic1 income*

Using Row Functions

Many of the tasks we carried out in long form can easily be done in wide form, with three changes:

  • Remove the by prefix. In wide form a level 2 group is a single observation, not a group of them.
  • Replace the egen function with its row equivalent.
  • Replace the variable to act on with a variable list matching all the corresponding level one variables.

Thus the long form command:

by household: egen householdIncome = total(income)

becomes in wide form:

egen householdIncome = rowtotal(income*)

However, row functions cannot take expressions as arguments, like (age<18) or (income*toUse). You can always take these expressions and turn them into variables, just like we ran gen child=age<18 before. However, in wide form this has to be done for each individual (gen child1=age1<18, gen child2=age2<18, etc.). We discuss how to do this efficiently in Stata Programming Essentials, but in most cases it's easier to work in long form.

Exercise: Create indicator variables for each individual, indicating whether they are black or not. You'll need 16 of them, so use copy and paste. Make sure your indicator variable is missing if race as missing (i.e. if that person doesn't actually exist).Then create two household-level indicator variables: one for "At least one individual in this household is black" and one for "All of the individuals in this household are black." Why don't you need to do anything special with the egen functions to handle missing values?

Panel Data

Panel data, or longitudinal data, are data where subjects are observed repeatedly over time and the timing is important. If timing isn't important then we call it repeated measures data. The National Longitudinal Survey of Youth is an example of panel data, and we'll use it a small extract from it as an example. (Note that this extract combines income variables from different years with slightly different definitions into a single income variable, so you really wouldn't want to use this extract for actual research.) Create a do file called panel.do that loads it:

capture log close
log using panel.txt, replace
clear all
use nlsy_extract

Exercise: Apply what you learned in "First Steps With Your Data" to this data set. In particular, identify the primary keys and the data structure that implies, and figure out the nature of the edu variable. What does it suggest about the data collection process that income and edu are frequently missing for the same observation? What does it tell you about age that it is never missing? What is a level one unit in this data set? What is a level two unit? Which variables are level one variables? Which are level two variables?

Most of the techniques we learned for working with individuals in household carry over directly to panel data. For example, to find the total income earned during the study period, run:

by id: egen totalIncome = total(income)

To find the age of the subject the first time they appear in the study, run:

by id: egen startingAge = min(age)

But what if you wanted to know their income the first time they appear in the study? Recall that income[1] means "the value of income for the first observation." If combined with by it means "the value of income for the first observation in the by group." This is highly convenient, but you need to make sure that the observations for each subject are in chronological order so their first observation really is the first time they appear in the study:

sort id year
by id: gen startingIncome = income[1]

You need to be careful because Stata's default sorting algorithm is not stable. This means it will put ties in whatever order will make it run fastest. So if you run sort id, or bysort id:, the observations for each person could be in any order. In practice, if the data are already sorted or mostly sorted the order that will make the sort run fastest is usually to leave things alone. But you can't count on that. So if you're going to run code that depends on the sort order, be sure the data are actually in the right order.

Exercise: Create endingIncome, the subject's income the last time they appear in the study. Recall that _N is the observation number of the last observation.

Calculations Based on Neighboring Observations

Sometimes you need to carry out calculations that take into account not just the current observation, but neighboring observations.

The edu variable is missing for years where the subject was not interviewed. In many cases the subject's level of education is the same before and after the gap and it would be safe to fill in those values.

We'll start by filling them in "forwards", meaning that value of edu before the gap is carried forward to fill in the missing values. Make a copy of the variable so we can compare the new version with the original:

gen eduForward = edu

Then fill in the gaps with the following:

by id: replace eduForward = eduForward[_n-1] if eduForward==.

The system variable _n is the observation number of the current observation. Thus eduForward[_n-1] means "the value of eduForward for the observation before the current observation."

The alternative is to fill edu in "backwards", meaning that the value of edu after the gap is carried backwards. You might think you can do this with the same command, just replacing [_n-1] with [_n+1]. However, that won't work because of the order in which Stata carries out a replace command.

In carrying out a replace command, Stata updates the observations in order starting from the first observation. Imagine a hypothetical subject who has observed values for edu in year one and year four, but not years two and three. When filling in "forwards", Stata first sets edu for year two to the value of edu for year one, then sets edu for year three to the value of edu for year two, which was carried forward from year one. If you tried to fill in "backwards" in the same way, edu for year two would be set to the value of edu for year three, which is missing. Then edu for year three be set to the value of edu for year four, but at that point it's too late to fill in the value for year two.

This does not mean it's impossible to fill in "backwards." It just means the sort order of the data needs to match the order in which we need the observations to be processed by replace:

gsort id -year
gen eduBackward=edu
by id: replace eduBackward = eduBackward[_n-1] if eduBackward==.
sort id year

The gsort ("generalized sort") command will sort observations in descending order rather than ascending order if you put a minus sign in front of the variable name. Thus this code puts the observations for each subject in reverse chronological order, fills in the eduBackward variable, and then puts them back in chronological order.

Run browse id year edu* to see the results. In most cases eduForward and eduBackward are the same, but subject number 1 is an exception: because subject 1 was never observed after 1981, eduForward fills in 12th grade for all the remaining years, while eduBackward still has missing values. Subject number 23 illustrates a different problem: they reported 3 years of college in 1983, then were lost until 1991 when they reported 6 years of college. Some time during the seven years in between they attended three years of college. but we don't know when and thus don't know their value of edu for any of those years. The safe thing is to only use filled in values when edu is the same before and after the gap, and thus eduForward is the same as eduBackward:

replace edu = eduForward if eduForward==eduBackward

Exercise: create an indicator for "the subject attended school this year." You'll know they attended school if their value of edu is higher than the year before. The variable should be missing if edu is missing for the current year or the year before.

Most likely your solution to the exercise above included if edu<. & edu[_n-1]<. This will also lead to the indicator variable being missing for the first observation for each subject (1979). This makes sense, but how did Stata know to do it?

When _n is 1 (the first observation for each subject) there is no observation _n-1. Many languages would give an error message like "index out of bounds" at this point. Stata proceeds, but sets edu[_n-1] to missing. This highlights the reason we need by id: for this command. If we did not have it, Stata would try to determine if a subject attended school in 1979 by comparing their education level in 1979 to the previous subject's education level in 2000. Having by id: in front of the command ensures each subject is handled separately.

Events and Event History

Often with panel data you'll need to identify particular events or sequences of events. For example, suppose you need to identify the year in which each subject graduated from high school. A subject graduated from high school in a given year if they have 12 years of education in that year and less than 12 years of education the year before:

by id: gen grad = (edu==12 & edu[_n-1]<12)

If either edu or edu[_n-1] are missing, this code will set grad to zero. So a one for grad technically means "We know the person graduated this year" while a zero means "We don't know that the person graduated this year." For event analysis that's frequently what you need.

When an indicator variable indicates that an event happened, the total of that variable is the number of times the event happened. To check your work, determine how many times each subject graduated from high school:

by id: egen timesGraduated = total(grad)
tab timesGraduated

Many subjects graduated zero times, but this is not surprising: either they really didn't graduate, or they graduated outside the study period, or missing data prevented you from identifying the year in which they graduated. Fortunately, no one graduated more than once. This could happen due to a data entry or reporting error and then you'd have to fix it.

Next create an indicator for "subject took a break from college." We'll identify a year that a subject took a break from college with:

  • They have started college (edu>=13)
  • They have not finished college (edu<16)
  • Their years of education completed is the same as the year before

by id: gen break = (edu>=13 & edu<16 & edu==edu[_n-1])

Now, create an indicator variable to identify people who took a break from college at some point in the study:

by id: egen tookBreak = max(break)

To see the results, run browse id year edu break tookBreak if tookBreak.

Exercise: Our current definition of taking a break from college includes dropping out of college permanently. Create a person-level indicator variable for "this person finished college" (i.e. at some point their edu is 16 or higher). Then modify the above commands so that only people who finish college are counted as taking a break from college.

Suppose you are interested in the effect of taking a break from college on subsequent outcomes, so you need to identify all the years after a subject took a break from college. Do so with:

gen afterBreak = 0
by id: replace afterBreak = (break[_n-1]==1) | (afterBreak[_n-1]==1)

The first term, (break[_n-1]==1), ensures afterBreak will be 1 in the year after the subject takes a break. The second term, (afterBreak[_n-1]==1) ensures afterBreak will be 1 in every year after that.

Why do you need to write out break[_n-1]==1 rather than just break[_n-1]? Remember, the construction if indicatorVariable only works properly if indicatorVariable has no missing values. Even though break has no missing values, break[_n-1] is always missing for each subject's first observation. (It asks for the value of break for the observation before the first observation, which does not exist.) A missing value counts as true, so if you just wrote break[_n-1] then afterBreak would be one for the first observation—and every observation after that.

Now suppose you need to know the number of "break" years the subject has taken, as of the current year. This will be the running sum of the break variable, where a running sum is the sum of all the observations up to and including the current observation. The sum() function calculates running sums, and is very useful any time you need to calculate how many times a subject has experienced an event. You might expect sum() to be an egen function since it acts across observations, but in fact it's a standard function since it only needs to look at prior observations.

by id: gen numBreaks = sum(break)

Exercise: Create an indicator variable that identifies those years that come after a break but before the subject graduates from college, meaning that edu is less than 16.

Level 2 Variables Based on a Special Level One Observation

Sometimes you need to create a level two variable which is just the value of a level one variable for a particular level one unit, such as the subject's age at the time they graduated from high school. Here's one way to do that:

sort id grad
by id: gen ageAtGraduation = age[_N] if grad[_N]

Sorting by id and grad puts the observation in which the subject graduated last, so you can get the subject's age in that year with age[_N]. However, recall that for many subjects we could not identify the year they graduated, so it's important to only set ageAtGraduation to age[_N] if they actually graduated in that year (i.e. grad[_N] is true).

An alternative method uses one of egen's aggregate functions with a subset which is just the observation of interest. Any aggregate function with the property that if you give it a single number it returns that number will do, including mean(), total(), min(), or max().

gen toUse=1 if grad
by id: egen ageAtGraduation2 = mean(age*toUse)

Exercise: Find the subject's age and income at the time of their first break from college.

Next: Restructuring Data Sets

Last Revised: 9/24/2019