Stata for Researchers: Hierarchical Data

This is part seven of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

Hierarchical data are data where observations fall into groups or clusters. The most common examples at the SSCC are individuals living in a household and a subject being observed multiple times, as in the data sets used in the previous section. But there are many other applications: schools within a district, courses taken by a student, or individuals who are part of a subject's social network can all be treated as hierarchical data. Hierarchies can have more than two levels; for example students may be grouped into classrooms which are grouped into schools which are grouped into districts.

This article will introduce you to a way of describing hierarchical data that will make it easier to talk about—and think about—its structure. It was developed by the community of statisticians interested in formal Hierarchical Linear Modeling, but you don't have to be doing HLM for the concepts and terminology to be useful. We'll then introduce two commands that act directly on the structure of hierarchical data sets: reshape and collapse.

Describing Hierarchical Data

Since hierarchical data can describe so many different things, we need to define terms that can apply to all of them. We'll describe the smallest unit in the data as the level one unit. In the examples mentioned above the level one unit would be an individual within the household, a particular time the subject was observed, a school within the district, a course taken by the student, or an individual within the subject's social network. A level two unit is a group of level one units: the household in which the individuals live, the subject which is measured repeatedly, the district which contains the schools, the student who takes the courses or the subject whose social network is being described. If needed, a level three unit is a collection of level two units, and so forth.

Most hierarchical data sets will include some variables which describe the level one units and some which describe the level two units. For example, a data set of individuals living in households may contain the age and sex of each individual, plus the household income of the household as a whole. Age and sex would then be level one variables while household income would be a level two variable.

Level two variables are easy to identify: they always have the same value for all the level one units in the same level two unit. For example, the total number of people in the household must be the same for every member of a household, or if a subject is observed multiple times he or she will have the same birth date each time. Anything that varies within a level two unit is a level one variable: individuals within a household can obviously have different employment statuses, so employment status must be a level one variable. While it's rarely difficult to identify which of your variables are level one and which are level two, taking a moment to do so during the planning stage of your project and before writing any code can help you avoid a lot of headaches.

Ideally each level will have an associated identifier: for example a household ID and an individual ID, or a subject ID and a survey wave ID. The level one identifiers only need to be unique within a level two group, and in fact the reshape command will only work if this is the case. Often data sets do not come with a usable level one identifier, but you learned how to create one in the last section (e.g. by household: gen person=_n).

Representing Hierarchical Data as a Matrix

Stata (like most statistical programs) stores its data in a matrix, where rows are observations and columns are variables. But when working with hierarchical data "observation" is an ambiguous term: it could mean either a level one unit or a level two unit. The purpose of the reshape command is to allow you to go back and forth between the two definitions at will, restructuring your data accordingly.

If an observation represents a level one unit then your data are in the long form, so named because it has more observations. In the long form, both level one and level two variables are represented by columns in the data matrix. However, level two variables will have many repeated values, since all the observations in the same level two group will share the same values of all the level two variables.

If, on the other hand, an observation represents a level two unit then your data are in the wide form, so named because it has more variables. In wide form, level two variables are represented by columns as usual. However, level one variables are represented by sets of columns, with each set containing a column for each level one unit. Thus the values of the level one variables for each level one unit within a level two unit are stored in the same row, but in different columns.

Consider the following data:

person wave birthdate education income
1 1 12/2/1963 HS Grad 60000
1 2 12/2/1963 HS Grad 65000
2 1 3/18/1966 Bachelor's 90000
2 2 3/18/1966 Bachelor's 0
3 1 6/6/1959 Some College 40000
3 2 6/6/1959 Bachelor's 110000

This is panel data, with three people each observed in two survey waves. Thus a level two unit is a person, while a level one unit is a person-wave combination. A person's birthdate never changes, so it is a level two variable. Meanwhile, a person's income clearly can change between waves, so it is a level one variable. Education looks like it might be a level two variable at first, but person number three apparently went back to school between waves and finished a Bachelor's degree. Thus education must be treated as a level one variable.

Since each row represents a level one unit, the above data are in the long form. Now consider the exact same data in a different format:

person birthdate education1 income1 education2 income2
1 12/2/1963 HS Grad 60000 HS Grad 65000
2 3/18/1966 Bachelor's 90000 Bachelor's 0
3 6/6/1959 Some College 40000 Bachelor's 110000

Here a row represents a level two unit, so this is the wide form. The level one variables education and income are represented by two columns each, one for each wave.

Using Reshape

Stata's reshape command allows you switch between the two forms at will. The general syntax is:

reshape long/wide "stubs" of level 1 vars, i(level 2 ID) j(level 1 ID)

Before talking through the syntax in detail let's do an example. Start a do file that loads the dataset reshape1.

Do a list and you'll see it's currently in the long form (as in this table). To change that, type:

reshape wide education income, i(person) j(wave)

Do another list to see the results (it should look like this table except for the placement of the birthdate column). To go back to long form, type:

reshape long education income, i(person) j(wave)

Referring back to the general syntax, long or wide is the form in which you want to put the data. Next comes a list of level one variables, but note that when the data set is in wide form it does not contain any variables literally called education or income. Instead you have education1, education2 and so forth. birthdate is not in the list, as it is a level two variable.

The i() option is where you give the level two identifier variable. j() is then the level one identifier—but note again that in wide form the data set does not have a variable called wave. When reshaping from wide to long, education income combined with j(wave) can be interpreted as "look for variable names that start with education or income, then take whatever follows those words and put it in a new variable called wave."

Removing Extraneous Cases

Now consider a (fabricated) data set consisting of individuals living in households:

household income age1 female1 age2 female2 age3 female3
1 30000 30 1 2 1 . .
2 90000 45 0 43 1 15 0

Here the level one unit is an individual, the level two unit is a household (with the household variable as the identifier), income is a level two variable, and age and female are level one variables. Since an observation represents a household, the level two unit, this data set is in wide form.

Start a do file that loads this data set, called reshape2.

If you need a dataset of individuals, all you need to do is reshape the data into the long form:

reshape long age female, i(household) j(person)

Here is the result:

household person income age female
1 1 30000 30 1
1 2 30000 2 1
1 3 30000 . .
2 1 90000 45 0
2 2 90000 43 1
2 3 90000 15 0

There's just one problem: person number three in household one. Note that in wide form the number of columns is determined by the household with the largest number of members: the data set must have age and female columns for all the individuals in that household. However, that means the smaller households must have those columns as well. In this data set, household one only has two people so age3 and female3 are missing. But the reshape command doesn't choose not to create an observation just because these variables have missing values. Conceivably there could be a third person in household one and we just don't know their age or sex.

Such agnosticism is rarely an option with real data. In big surveys the largest household tends to be very large indeed, so you could easily have twenty or more age and female variables. Converting them all to observations results in a data set consisting mostly of missing values. Thus you most likely want to drop these extraneous observations:

drop if age==. & female==.

If you had more level one variables you would still use all of them in the if condition, to be sure that you don't eliminate a real person who is only missing a few variables.

Collapse

Sometimes you need to remove the level one units from your data entirely, leaving a data set of level two units. If all the level two variables you need have already been created, you can do so with code like:

drop age sex
by household: keep if _n==1

Don't put this in your do file. But if you did, the first line would drop all the individual level variables, and the second would keep just the first observation in each household. (Once the level one variables are gone, it doesn't matter which one you keep because they're all the same.)

But if you need to calculate some level two variables before dropping the level one units, the collapse command may be able to do both for you. collapse says it "converts the dataset in memory into a dataset of means, sums, medians, etc." Left unsaid is that it does so across level two units.

The basic syntax for collapse is:

collapse (statistic1) varlist1 (statistic2) varlist2..., by(level 2 ID)

The various statistics collapse calculates can be found by typing help collapse, but some particularly useful ones are mean (the default), count, and first. By default collapse will replace the current values of each variable with the statistic it calculates for that variable, but you can have it rename variables with newvar=oldvar. Just note that oldvar will be removed—as will any variables not listed somewhere in the collapse command.

Suppose you want to reduce the data set of individuals you have now to a data set of households, and for each household you need to know the household income (which you already have), the proportion of household members which are female, and the size of the household. You can do so with the following collapse command:

collapse (first) income (mean) propFemale=female (count) size=person, by(household)

Since income is already a level two variable all you need to do is carry it over into the new data set. (first) income does so by setting the income statistic to the first value of income for each household.

(mean) propFemale=female takes advantage of the fact that the mean of an indicator variable is the proportion of observations which have a one for it. The statistic is calculated based on the female variable, but the result is called propFemale.

The (count) statistic counts how many observations have a non-missing value of the listed variable. To get the size of the household you want to count all observations, so you can have it count any variable that's never missing. Identifiers (person in this case) are good candidates because they are rarely missing, but you should always check.

Exercise

  1. Consider the data set reshape4.dta. What is a level one unit? What is a level two unit? What are the level one and level two variables? Reshape it to wide form and then go back to long form. Now convert it to a data set of just countries, with variables meanPop (mean population over the period) and maxCGDP (maximum cgdp over the period). (Solution)

Next: Combining Data Sets

Previous: Working with Groups

Last Revised: 1/4/2016