Stata Programming Techniques
for Panel Data

Panel data, where subjects are observed repeatedly over time, is a very common data structure in the social sciences. This article will teach you some programming techniques used to prepare panel data for analysis. They include:


Panel data is a particular kind of hierarchical data, where the level 2 unit is a subject and the level 1 unit is a subject observed in a particular period. (If you're not familiar with this vocabulary for describing hierarchical data, here's an introduction to it.) Panel data normally includes both variables that change over time (level 1 variables) and variables that do not (level 2 or subject-level variables). It's very important that you know the type of each of your variables.

While this article will describe the problems to be solved in terms of subjects observed over time, some of these techniques apply to any hierarchical data structure. In fact the code for some of these examples is essentially identical to code used in the Working with Groups section of Stata for Researchers to identify characteristics of households composed of individuals.

This article will assume that the data set is in the long form, i.e. there is one observation per person per period. (The kind of work described here is usually much easier in long form.) It will also assume that the data set is sorted by subject and period, so the observations for each subject are in chronological order.

A note on variable names: with this sort of data you could easily have variables for "was incarcerated in this month," "was incarcerated at some point," or "had been incarcerated by this period." Use clear variable names and/or labels to help you distinguish between them.

Identifying an Event

An event could be anything from "was incarcerated" to "was diagnosed with diabetes" to "attended a four-year college." Before doing anything else, you need to identify an if condition which will be true if the event happened in a given time period and false otherwise. For example, if you have an inc variable which is 1 if the subject was incarcerated in a given month and 0 otherwise, the condition is simply inc==1. (If you have no missing data, just inc will do because to Stata 1 is true and 0 is false—just keep in mind that any other value, including missing, is also treated as true.) Other examples might be diag==146 (assuming 146 is the code for diabetes in your data set) or enrolled & type==4.

Did a subject ever experience this event?

Often you need to be able to identify those subjects for whom the event occurred at any point. At this level it does not matter when the event occurred or how many times. The result will be a subject-level binary variable which will be 1 in all time periods if subject ever experienced the event and 0 in all time periods otherwise.

Use max() function in the egen library to identify whether the event condition identified above is ever true:

by subject: egen everInc=max(inc)

by subject: egen everDiabetes=max(diag==146)

by subject: egen ever4yr=max(enrolled & type==4)

Recall that true is 1 and false is 0. Thus if the condition is true (1) for any observation max() will return 1. If it always false (0), max() will return 0.

How many times did the subject experience this event?

If you need to know how many times the subject experienced the event during the study period, use the total() function:

by subject: egen timesInc=total(inc)

by subject: egen timesDiag=total(diag==146)

by subject: egen semesters4yr=total(enrolled & type==4)

How many times has the subject experienced the event thus far?

If you need to know how many times the subject has experienced the event up to and including the current period, use the sum() function. It calculates a running sum over all the observations it has seen thus far (which is why it is a regular function and not part of the egen library).

by subject: gen timesIncThusFar=sum(inc)

by subject: gen timesDiagThusFar=sum(diag==146)

by subject: gen semesters4yrThusFar=sum(enrolled & type==4)

Has a subject experienced this event yet?

Sometimes you need to identify which periods come before the subject first experienced an event and which come after. The result will be a binary variable which changes over time, but only from 0 to 1 (i.e. it will be some number of zeros followed by some number of ones). The easy way to do this is to first create a variable counting how many times the subject has experienced the event thus far using sum() as described above. If that number is greater than zero, the subject has experienced the event.

gen hasBeenInc=(timesIncThusFar>0)

gen hasDiabetes=(timesDiagThusFar>0)

gen has4yr=(semesters4yrThusFar>0)

Events in Sequence

Sometimes the order in which events happen matters. For example, you may be interested in events where a given drug is prescribed before the patient has received a diabetes diagnosis, or in people who attend a two-year college after attending a four-year college. We'll call the first event event A and the second event B.

The first step is to create a variable for "Has a subject experienced event A yet?" as described in the previous section. Then you can take whatever condition identifies event B, add the condition that the subject either has or has not experienced event A yet, and use any of the techniques described above. Two examples:

gen prescribeBefore=(drug==216 & !hasDiabetes)

by subject: egen everPrescribedBefore=max(drug==216 & !hasDiabetes)

After running this code, prescribeBefore is an indicator for "in this period the subject was prescribed drug 216 before having received a diagnosis of diabetes" and everPrescribedBefore is a subject-level indicator for "the subject was prescribed drug 216 before receiving a diagnosis of diabetes at some point."

If you were interested in "after" rather than "before" you'd simply change & !hasDiabetes to & hasDiabetes.

Changing Time Periods

Sometimes the "periods" in panel data represent different amounts of time for different subjects. For example, the time between survey interviews may vary, or semesters start and end at different times at different institutions. In that case it may be convenient to break up the original time periods into smaller standard periods, such as months, especially if the data for each period can be assumed not to change during the period.

Load the following example data set:


This contains (fictional) data about students in college. There is one row per student per semester, and the level variable tells us whether they attended a two-year or four-year institution in that semester. The variables startYear, startMonth, endYear, and endMonth tell us when each semester began and ended. Our goal is to restructure the data such that we have one observation per student per month.

Begin by creating a semester identifier, numbering them in chronological order:

sort id startYear startMonth
by id: gen semester=_n

Next convert the date variables into Stata's date format, using months as the base unit:

gen start=ym(startYear,startMonth)
gen end=ym(endYear,endMonth)
format start end %tm

(If you're not familiar with Stata dates, you can learn about them here.)

Since the dates are now stored as "number of months since January 1960" (with a format that makes them readable by humans when printed in output) the duration of a semester can be calculated by subtracting the start date from the end date and adding 1:

gen duration=end-start+1

Next, use the expand command to create one copy of each semester for each month the semester lasted:

expand duration

We now have one observation per student per month, but there's no indication which month a particular observation represents. Since they're all the same (the student is assumed to have attended the same institution for the duration of the semester) we will simply assign them in order: the first observation for a given semester will represent the first month of the semester, the second observation the second month, etc.

Sometimes relative times (first month, second month, etc.) are adequate. In that case you can simply set an observation's month to its observation number within the semester:

bysort id semester: gen month=_n

But in this case it's probably much more useful to have a time variable that tells us in absolute terms which month the observation represents. The month the semester started is stored in the start variable, so all you have to do is add the observation number minus one:

bysort id semester: gen time=start+_n-1
format time %tm

time is now the only time variable you need, so you can drop all others plus the variables used to get this far:

drop start* end* dur semester

You now have a data set with one observation per student per month for the months the student was enrolled in school. You probably want observations for all of the months in the study period where the student was not enrolled in school as well.

The fillin command ensures there's an observation for each unique combination of the variables it's given, creating new observations as needed. Newly created observations will have missing values for all the other variables and will have a _fillin variable set to 1, but we won't need that variable for this task. Running:

fillin id time
drop _fillin

will ensure there's an observation for each student for each month in the data set. If at least one student is enrolled in every month during the study period, this will create a full set of observations.

At this point months in which the student was not enrolled have a missing value for level (the level of the institution attended, two-year or four-year). You could change that to 0 if you prefer:

replace level=0 if level==.

However, if a month is entirely missing from the data set (because no one is enrolled during that month) fillin will not create observations for that month. In that case, if you want observations for every month in the study period you'll need to add the missing months.

The following code calculates the length of the study period, creates one observation for every month in that period for a fake student, then combines that with the original data, runs fillin, and drops the fake student. The result is a data set with an observation for every month for all students.

// save what you've done thus far
save sem_panel2, replace

// keep only the time variable
keep time

// calculate the duration of the study period
// sort by time, then subtract first time from last time
sort time
gen dur=time[_N]-time[1]

// cut to one observation
keep if _n==1

// expand it to one observation per month in the study period
expand dur

// set time for each observation
replace time=time+_n-1

// create id for fake student
gen id=-9

// add back to full data
append using sem_panel2

// create observations using fillin
fillin id time
drop _fillin

// drop the fake student
drop if id==-9

Last Revised: 8/21/2014