Data Wrangling in Stata: Restructuring Data Sets

This is part six of Data Wrangling in Stata.

In this section we'll learn how to restructure data sets. We'll learn how reshape data so we can switch at will between the long form (one row per level one unit) and wide form (one row per level two unit). We'll also learn how to turn a data set containing both level one and level two units into a data set containing just level two units.

Reshape

The reshape command is very simple to use, if you understand the structure of the your data set. In particular, you need to know the level one and level two identifiers, and which variables are level one variables and which are level two.

Start a do file that loads the cleaned version of the 2000 American Community Survey we worked with earlier:

capture log close
log using restructure.log, replace
clear all
use 2000_acs_cleaned

Recall that this is hierarchical data consisting of people living in households, so a person is a level one unit and a household is a level two unit. The level one and level two identifier variables are person and household, and you can confirm that they uniquely identify observations with:

duplicates report household person

The variables age, race, maritalStatus, edu, income, female, and hispanic all describe individual persons, making them level one variables. At this point we don't have any variables that describe the household, so let's make one before proceeding:

by household: egen householdIncome = total(income)

The variable householdIncome is a level two, or household-level variable, because any persons who live in the same household will have the same value of householdIncome.

The data set has one observation per person, or level one unit, so the data set is currently in long form. In wide form it would have one observation per household, and reshape can do that for you:

reshape wide age race maritalStatus edu income female hispanic, ///
i(household) j(person)

The syntax of the reshape command begins by specifying the form you want, in this case wide. Then you give a list of all the level one variables. This is a bit different from the usual syntax where the list of variables tells the command which variables to act on. The reshape command always reshapes the entire data set, but to do so it needs to understand which variables are level one and which are level two. The variables you list are level one variables; any variables you do not list are assumed to be level two variables. (Make sure this is true!) Finally, the command needs to know the identifiers. The i() option specifies the level two identifier, which could be a compound identifier; the j() option specifies the level one identifier. It calls them i and j rather than level one and level two because reshape can be used on data with more than two levels of hierarchy.

Take a moment to look at the data browser. As promised, there is now just one observation per household, and the household variable is now a unique identifier all by itself. Most of the variable names now have two parts: the name of the quantity described (e.g. income) followed by the number of the person being described. The person variable has been changed from part of a compound row identifier to part of a compound column identifier.

Note that there are 16 of each of the level one variables. The largest household had 16 people in it, so storing the information for all of them required 16 of each variable. Since the data set has to be rectangular, that means all the households have 16 of each level one variable, with most of them containing missing values. The householdIncome variable (on the far right in the data browser) remains a single variable because it is a household level variable.

Reshaping from wide form to long form requires the exact same command, just replacing wide with long:

reshape long age race maritalStatus edu income female hispanic, ///
i(household) j(person)

However, the meaning is quite different: age, race, etc. do not refer to individual variables, but to groups of variables, and j(person) does not refer to an existing variable at all. The reshape long command will identify all the variables that start with age, race, etc. then take what follows and store it in a new variable called person.

With reshape wide, the list of level one variables is a list of actual, existing variables, so you can use shortcuts like age-hispanic. With reshape long you are giving a list of "stubs" of variable names, so you must list them all individually (individual stubs, that is, not individual variables). We recommend dropping any variables you won't actually use very early in the data wrangling process, but you definitely want to get rid of them before using reshape long.

However, if you look in the data browser there's a lot more missing data than there used to be. You can see the problem by running:

duplicates report household

Every household now has exactly 16 observations in it. That's because in wide form every household had 16 of each level one variable. This will happen any time you reshape a data set from wide to long: every level two unit will end up with same number of level one units as the largest level two unit. So now you have a bunch of observations that do not actually represent people. Fortunately you can easily detect and drop the extraneous level one units because they have missing values for all the level one variables.

drop if age==. & race==. & maritalStatus==. & ///
edu==. & income==. & female==. & hispanic==.

Keep in mind that real people can have missing data, so you would not want to drop observations that have a missing value for one level one variable or even a few of them. Include all the level one variables in your if condition.

Exercise: Start a new do file, and load the data set nlsy_extract. Identify the level one and level two units, and the level one and level two variables. Run duplicates report id. Reshape the data set to wide form, and then reshape it again to long. Run duplicates report id again. Why don't you need to worry about extraneous observations in this case?

Collapse

Sometimes you want to get rid of the level one units in your data set entirely so you're left with a data set of level two units. For the ACS that would be a data set of households, with no individual-level variables.

If the data is already in wide form this is just a matter of dropping the level one variables.

If the data is in long form, like our ACS sample, begin by dropping the level one identifier and all of the level one variables:

drop person age race maritalStatus edu income female hispanic

Next, keep just one observation for each level two unit. All the observations for a given level two unit are identical so it doesn't matter which one you keep, but the first one is convenient:

by household: keep if _n==1

Often, however, you need to create new level two variables based on the level one units before you can eliminate the level one units entirely. In this example we'll create variables for "number of people in the household" and "proportion of the household that is female" as well as keeping the household income variable we created previously. You can always do that by creating the level two variables using the methods described in the previous section and then dropping the level one units using the method just described. But if all the variables you need to create are summary statistics, the collapse command can do the entire process for you quickly and easily.

The collapse command takes all the observations (level one units) for a given level two unit and aggregates them into a single observation. Thus it needs to know the level two identifier, which variables you want aggregated, and how you want to aggregate them. Most of the aggregation rules are based on summary statistics.

Since we're going to convert the data set to a data set of level two units again but in a different way, comment out the previous commands that did it:

//drop person age race maritalStatus edu income female hispanic
//by household: keep if _n==1

Instead, run the following:

collapse ///
(first) householdIncome ///
(mean) proportionFemale=female ///
(count) householdSize=person, ///
by(household)

Now consider each element of the command in turn:

(first) means the variables that follow should be aggregated using the rule "keep the first value." The householdIncome variable is already a level two variable, so all the values for a given level two unit are the same and we just need to keep the first one.

(mean) means the variables that follow should be aggregated using the rule "take the mean." The female variable is binary, taking its mean tells us the proportion of household members that are female. However, we don't want to call the result female, so we rename it to proportionFemale. It's a bit backwards, but (mean) proportionFemale=female can be read "take the mean of the female variable and call the result proportionFemale."

(count) means the variables that follow should be aggregated using the rule "count the number of non-missing values." What we really want is the number of observations, but for any variable with no missing values that will be the same thing. If we had to we could create such a variable (gen i=1) but we know that person has no missing values (identifiers rarely do) so we can use it. Again, we want to give the result a new name, and the syntax (count) householdSize=person can be read "count the number of non-missing values of person and call the result householdSize."

You can list many variables after each aggregation rule, not just one. You can also skip specifying an aggregation rule, in which case Stata will assume you want means.

The by(household) option tells collapse the level two identifier so it knows which observations to aggregate.

Note that after running this command, the only variables left in the data set are those mentioned in it.

Exercise: Go back to nlsy_extract and use collapse to convert it into a data set with one row per person. The resulting data set should contain the person's year of birth, their mean income over the study period, and their maximum educational attainment. (Bonus question for economists: what's the problem with simply taking the mean of income over the study period? We'll fix it in the next section.)

Next: Combining Data Sets

Last Revised: 9/26/2019