Solution

Two notes before diving into the details:

  1. It's probably worth your time to try to combine these files before fixing all the errors so you can see what the result looks like. When it comes to programming, "experience" often comes down to having made a mistake before so you recognize it faster when you make it again.
  2. Given that you have two files to fix and Stata only works with one file at a time, the efficient way to do it is to load one, fix all the problems in it, save the results, and then move on to the other. But for clarity we'll discuss fixing one kind of error at a time in both files and omit the implied save and use commands. We'll put everything together and in the proper order at the end.

Begin by doing a describe on both data sets:

use error2007
d
use error2008
d

The first problem is the variable names: in error2007 the identifier variable is called studentID while in error2008 it is called id. In error2007 the teacher identifier is called teacher, while in error2008 it is called teacherID. Meanwhile the student's score is called testScore in error2007 and simply score in error2008. It doesn't matter what the variables are called, but they must be consistent. We'll go with id, teacher and score:

rename studentID id
rename testScore score
rename teacherID teacher

Second, in error2007 the id variable (formerly studentID) is a string while in error2008 it is a number. Strings are nice for identifiers because you never have to worry about rounding, so we'll convert the id in error2008 to a string. However, there's an additional complication: the string identifiers in error2007 have leading zeroes for if they are less than ten ("01", "02" etc.). One option would be to tell the string function to use a format that includes leading zeroes when converting the numbers to strings. But rather than look up the proper format I'd suggest a straightforward replace:

gen temp=string(id)
replace temp="0"+temp if id<10
drop id
rename temp id

The third problem is duplicate observations. You can examine them with:

duplicates report id
bysort id: gen copies=_N
l if copies>1

This reveals that in error2007 you have two student 41's, and they're different. There's just one student 41 in error2008 and you have no idea which it should match with, so you'll have to drop both student 41's from error2007. You could drop student 41 from error2008 as well, but you'll take care of that when you do the final merge by only keeping observations that match.

drop if id=="41"

(Remember we converted id to a string, so the number 41 must go in quotes.)

Meanwhile in error2008 we have two student 37's and two observations with missing values for id. error2007 has no observations with missing values for id, so they won't match with anything and will be deleted automatically. The two student 37's on the other hand, turn out to be the same person--their data are identical. Thus all you need to do is drop one:

duplicates drop id, force

Now we've fixed the errors, but we still have to keep track of which year is which. To put things in the wide form that means adding the year to the end of all the level one variable names:

rename score score2007
rename race race2007
rename teacher teacher2007
rename teacherRace teacherRace2007

And similar for 2008.

Now you're finally ready to merge. For now keep everything whether it matches or not:

merge 1:1 id using fixed2007

You'll see that despite your best efforts, some observations did not match. Examine the problem observations with:

browse if _merge!=3

The observations with the id's "41" and "." were dealt with previously and you expected them not to match. However, "40", "20", "55" and "60" are unexpected. But they do not indicate mistakes: if you go back to the original files, you'll see that error2008 has no student 40, and error2007 has no students 20,55 or 60. Thus you've done the best that can be done.

You can eliminate the unmatched observations (assuming they're no use to you without both years' data) with:

drop if _merge!=3

Alternatively you can go back and chance the merge command to:

merge 1:1 id using fixed2007, keep(match)

But note that if you did that from the beginning you couldn't check your results.

Following is the complete code, in the proper order and including all the use and save commands:

use error2007
d
rename studentID id
rename testScore score

duplicates report id
bysort id: gen copies=_N
l if copies>1
drop if id=="41"

rename score score2007
rename race race2007
rename teacher teacher2007
rename teacherRace teacherRace2007
save fixed2007, replace

use error2008
d
rename teacherID teacher

gen temp=string(id)
replace temp="0"+temp if id<10
drop id
rename temp id

duplicates report id
bysort id: gen copies=_N
l if copies>1
duplicates drop id, force

rename score score2008
rename race race2008
rename teacher teacher2008
rename teacherRace teacherRace2008
save fixed2008, replace

merge 1:1 id using fixed2007
browse if _merge!=3
drop if _merge!=3

Moving on to appending the data, you'll find that Stata will append the two data sets as-is without any complaints:

use error2007
append using error2008

The result is unusable, but browse and see why. The merge command is much more likely to crash than the append command, but that's a good thing--it brings problems to your attention that would have caused trouble eventually anyway.

You'll need to do a bit of fixing just to see the more interesting errors with append:

use error2007
rename studentID id
save fixed2007b
use error2008
gen temp=string(id)
replace temp="0"+temp if id<10
drop id
rename temp id
append using fixed2007b

To identify the duplicate observations you can follow the same procedure as before, but note that each ID should have two observations:

duplicates report id
bysort id: gen copies=_N
browse if copies!=2

You'll see the same suspects as before. Alternatively, if you had created a year variable (as you'd need to if you actually wanted to use this data) you could have included it in your duplicates report and bysort commands and then the expected result would be one observation per id/year combination.

Last Revised: 2/5/2010