Stata for Researchers: Combining Data Sets

This is part seven of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

Combining two data sets is a common data management task, and one that's very easy to carry out. However, it's also very easy to get wrong. Before combining data sets be sure you understand the structure of both data sets and the logic of the way you're combining them. Otherwise you can end up with a data set that you think is ready for analysis, but is really utter nonsense. Stata tries to make sure you've thought through what you're doing, but can't tell you what makes sense and what doesn't.

Stata always works with one data set at a time, so you will always be combining the data set in memory (the master data set) with another data set on disk (called the using data set, for reasons that will be clear when you see the syntax).

Appending Data Sets

When appending, data sets are stacked.

Stata calls it appending when you add the observations from the using data set to the master data set. Appending makes sense when the observations in both data sets represent the same kind of thing, but not the same things. For example, you might append a data set of people from Wisconsin to a data set of people from Illinois. Both data sets should have the same (or close to the same) variables, with the same names. If a variable only appears in one data set, observations from the other data set will be given missing values for that variable.

The syntax is to carry out an append is simple: load the the master data set and then type:

append using dataset

where dataset is the name of the data set you want to append.

Merging Data Sets

When merging, data sets are placed side by side.

Stata calls it merging when observations from the two data sets are combined. There are, in theory, four kinds of merges:

In a one-to-one merge, one observation from the master data set is combined with one observation from the using data set. A one-to-one merge makes sense when the observations in both data sets describe the same things, but have different information about them. For example, you might merge the answers people gave in wave one of a survey with the answers the same people gave in wave two of the survey.

In a one-to-many or many-to-one merge, one observation from one data set is combined with many observations from the other (the difference between one-to-many and many-to-one being whether the master data set has the "many" or the using data set). These merges make sense when you have hierarchical data, and one data set contains information about the level one units while the other contains information about the level two units. For example, you might merge information about households with information about the individuals who live in those households.

In principle there are also many-to-many merges. In practice they are rarely if ever useful. Fortunately Stata will no longer let you do one by mistake.

In all the merges we'll discuss, Stata combines observations that have the same value of a key variable or variables, typically an ID. In a one-to-many or many-to-one merge, it is the identifier for the level two units that is the key variable (e.g. household ID, not individual ID). It's very important that the key variable have the same format in both data sets.

If an observation in one data set does not match with an observation in the other, it will be given missing values for the variables from the other data set. Since the viability of a research project often depends on how many observations actually merge (e.g. how many people from wave one of the survey could be found in wave two?) Stata gives you tools for figuring out how many observations actually merged and for examining those that didn't.

If a variable exists in both data sets, the values from the master data set will be kept and the values from the using data set will be discarded. Occasionally this is what you want, but it's more likely to be an error. In general you should set up your data such that the only variables the files to be merged have in common are the key variables.

The syntax for a merge is:

merge type keyvars using dataset

The type must be 1:1 (one-to-one), 1:m (one-to many), m:1 (many-to-one) or m:m (many to many); keyvars is the key variable or variables; and dataset is the name of the data set you want to merge. Stata can figure out what type of merge you're doing by looking at the data sets and key variables, but as of Stata 11 you must specify what kind you think you're doing so Stata can stop you if you're wrong.

An Example of Combining Data Sets

The examples include several files containing fictional student information from 2007. scores.dta contains the students' scores on a standardized test, demographics.dta contains demographic information about them, and teachers.dta contains information on their teachers. Take a moment to look at each file, then load the test scores:

use scores, replace

In this data set, each observation represents a student. browse and you'll see that you have a student ID (id), a teacher ID (teacher) and a score for each.

Your first task is to add in the demographic information. In demographics.dta each observation also represents a student, with the variables being id and race. Thus this is a job for a one-to-one merge and the key variable is id.

merge 1:1 id using demographics

Stata will report that all 60 observations matched. It will also create a variable called _merge. A one in _merge means an observation only came from the master data set; a two means it only came from the using data set; and a three means an observation successfully matched and thus came from both. In this case we see that all observations matched and thus have _merge equal to three, so there's no need to keep the variable. In fact we need to drop it (or rename it) before doing any further merges:

drop _merge

Next add information about teachers. In teachers.dta each observation represents a teacher, and each teacher has many students. That makes this a many-to-one merge (since the many students are currently in memory and the one teacher is in the using data set). The key variable is not id, since that refers to the students, but teacher:

merge m:1 teacher using teachers

Again, all 60 observations merged properly, so you can drop _merge.

drop _merge

Combining Panel Data

Now suppose you were tracking these students for multiple years. The data set panel2007.dta contains a simplified version of this data set: just id and score. The data set panel2008.dta has the same variables for a different year. How would you combine them?

The proper way to combine them depends on what data structure you want. This is hierarchical data where a level two unit is a student and a level one unit is a student's data for a particular year. Thus it can be represented in wide form (one observation per student), or in long form (one observation per student per year).

To put the data in long form simply stack the two data sets using append. However, you'll need to know which year an observation represents. To do that, add a year variable to each data set, with the value 2007 for the 2007 data and the value 2008 for the 2008 data. You can do so with the following code:

clear
use panel2007
gen year=2007
save panel2007_append
use panel2008
gen year=2008
append using panel2007_append

To put the data in wide form, do a one-to-one merge with id as the key variable. But first you need to change the variable names. Recall that in wide form, it is the variable names that tell you which level one unit you're talking about. So instead of score, you'll need score2007 and score2008.

clear
use panel2007
ren score score2007
save panel2007_merge
use panel2008
ren score score2008
merge 1:1 id using panel2007_merge

This time you'll see that one observation does not match. You can see which one by typing:

l if _merge==2

Student number 55 was not in panel2008 and thus couldn't be matched. As a result we have no idea what his or her test score was in 2008. Unfortunately this is very common.

If your entire research agenda depends on having both test scores, you may need to drop observations that don't exist in both data sets. You can do so at this point by typing:

drop if _merge!=3

You can also specify which observations should be kept directly in the merge command:

merge 1:1 id using panel2007_merge, keep(match)

keep(match) means only keep observations which match. The alternatives are master and using, and you can list more than one. For example, to keep observations which match and observations that only come from the master data set, while throwing away observations that only come from the using data set, you'd say keep(master match).

Common Problems with Merges

Merges will uncover all sorts of problems with your data set (and if they're not fixed merging will introduce new ones). Here are a few common ones:

Key Variables Stored in Different Formats

While Stata will happily match different kinds of numbers (ints and floats, for example) it can't match numbers and strings. IDs can be stored as either (as long as you choose a numeric type that has enough precision--see Working with Data) and it's not uncommon to find that your data sets store the ID in different ways. In that case it's usually best to convert the numbers to strings:

gen idString=string(id)
drop id
rename idString id

The string() function takes a number and converts it to a string. You can give it a second argument containing the format in which the number should be "written" if needed.

Duplicate IDs

Duplicate IDs will turn what should be a one-to-one merge into some other kind--quite likely one that doesn't make sense.

One possibility is that you simply misunderstood the data sets. If you think you're merging household data and it turns out that one file actually contains individuals, then duplicate household IDs in that file do not indicate a problem. Just be glad the error message brought the true structure of the file to your attention.

Another source of duplicates is round-off error due to saving the IDs in an inappropriate variable type.

But the most common reason for duplicate IDs is errors in the data. These will have to be resolved in some way before merging. You can see how many problems you have with the duplicates command:

duplicates report id

This will tell you how many observations have the same value of id. For further examination, you can create a variable that tells you how many copies you have:

bysort id: gen copies=_N

Then you can look at just the problem observations with:

browse if copies>1

If it turns out that in the copies the entire observation is duplicated (e.g. people are in the data set more than once for some reason) you can delete the extra observations with:

duplicates drop id, force

(the force option reminds you you're about to change your data). However, if the observations are different (e.g. you've got different people with the same ID) and if merging is vital to your research agenda, you may need to drop all observations with duplicate IDs. (If file one has two person ones, which should be merged with person one in file two?) You can do so with:

drop if copies>1

For purposes of merging, missing values are treated just like any other value. If you've got observations with missing IDs you'll probably have to drop them.

Exercises

  1. An alternative way to combine data sets into wide form is to first append them and then reshape. Combine panel2007 and panel2008 into wide form using this method. (Solution)
  2. If you combine the two years of data into long form using append, how can you find the observation which doesn't appear in 2008? (Solution)
  3. Combine error2007.dta and error2008.dta by merging it into the wide form, fixing whatever problems you find. Then append the two data sets and try to locate the same problems. (Solution)

Next: Graphics

Previous: Hierarchical Data

Last Revised: 8/22/2011