Stata for Researchers: Project Management

This is part ten of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

In a typical project you have a research question you want to answer and some data that you think will answer it, but the data isn't in a form that can actually answer the question—yet. Project management is about getting from raw data to completed analysis.

Simple Best Practices

Books have been written about how to manage research projects properly. While we won't go into that level of detail here, we will suggest a few simple best practices that can save a tremendous amount of time and reduce the probability of making serious mistakes.

Master your Data Set

Everything you do depends on your data set, so spend the time to get to know it well. How was it collected? What does an observation represent? What variables are available? Is there any hierarchy or other structure? Which variables are continuous, categorical, binary or text? How are they distributed? Is there missing data? How much? Is there structure to the missing data ("People who answered X to question Y then skipped to question Z")?

Real mastery (i.e. knowing the answers to most if not all of these questions without having look them up) will come as you're working on your project, but you should spend a significant amount of time learning about the data before you start writing code. Sometimes the answers to these questions can affect the feasibility of your research agenda. For example, crosstabs may reveal that you have too few observations in certain cells to get meaningful results. ("Not enough women with PhDs in science and engineering" can be a statistical problem as well as a social problem!)

Begin with the End in Mind

Before you write any code, decide out what form the data needs to be in so you can analyze it. What should an observation represent? What variables will each observation need to contain? The answers to these questions will most likely be determined by the statistical techniques you plan to use. Thinking it through ahead of time will prevent you from spending time putting the data in a form that seems natural, but doesn't actually meet your needs.

Don't Try to do Everything at Once

Once the goal is clear in your mind, don't try to write one massive do file that gets you there in one step, only trying to run it once it's "done." If you do, the do file will most likely have a large number of bugs. Then you may find that in order to make one part work, you need to do something in a different way than you originally planned. You'll then have to change everything that follows.

It's far better to write a bit of code, test and debug it, then write a little more, test and debug it, and so forth. But if a do file gets too big, you waste time rerunning old code you know is good in order to test what you just added. When that happens it's time to start a new do file.

Split Your Code into Multiple Do Files

If a do file gets too long, as you go through the write-test-debug cycle you'll find yourself spending too much time waiting for code you know is good to run so it can move on to the code you just added and need to test. More generally, you want to write do files that are short enough that while you're working on one you can remember everything it does.

To break up a long do file into smaller pieces, just pick a logical stopping point, have the do file save the data set at that point, then create a new do file that uses that data set as its starting point. Just remember: never save your output data set over your input data set.

Put Code for Different Purposes in Different Do Files

While data preparation is a linear process with each step depending on what came before (first recode, then clean...), exploratory analysis often branches out (maybe ordinary linear regression is okay, but I'll try a mixed model too...). Then when you've identified the results you want to report or publish, you want the code that produces them to be as clean, clear, and concise as possible. Thus it's best to have separate do files for each of these purposes.

For most projects there should be a "final" data set that's used for all analysis. That way you can open it up interactively and try things, write do files that analyze it in different ways, and generally experiment at will without running the risk of forgetting that, for example, the do file that ran the linear regressions also did a bit more recoding.

Checking your Work

Programming errors can be subtle and very difficult to catch by just staring at your code. Generally it's more effective to spend your time comparing your results to what they should be. Of course this depends on having some sense of what they should be: be constantly on the lookout for information you can use to check your work.

Examine summary statistics and frequencies frequently as you carry out data preparation, especially when you create new variables or change the structure of your data. See if what you get is plausible. If the results change, be sure you can explain why.

Spend even more time looking at individual cases. Use the browse command, often with a varlist and an if condition to allow you to focus on what's currently relevant, and compare what your do file did to individual cases with what you meant it to do. If you have different types of cases, be sure to look at samples of each.

If you do find problems, looking at cases is the best way to solve them. What kinds of cases get the wrong answers? Which variables are wrong? Figuring out those details will point you to the particular commands that need to be corrected.

Make your Project Reproducible

With proper organization you should be able to reproduce your entire project at will.

Start with the data as you obtained it. Your first do file will read it in, make some changes, and save the results in a different file. Your second do file will read in the output from the first do file, make further changes, and then save its results in another separate file. Repeat until data preparation is complete. Then all your analysis do files will read the same final data set and analyze it in various ways.

If you discover errors or need to make changes, having a well-organized and reproducible project will save you significant amounts of time. To track down an error, run your do files one-by-one, checking the results after each, until the error appears. Then you'll know which do file needs to be fixed. Once the error is corrected or the change is made, consider whether it will affect subsequent do files. Once all the needed changes are made, simply rerun all your do files.

Consider writing a master do file that runs all the do files required by the project, in the proper order (recall that one do file can run another simply by running the command do otherDoFile). Also write a "readme" document to keep with the project files, containing other relevant information. This will be very valuable to anyone else who has to work with your code, but also to the future you who has to try to remember how it all worked months or years later.

Case Studies

Two stories that illustrate the importance of proper project management:

One day a professor and her research assistant came to the SSCC's statistical consultants. They were working with census data from multiple countries over many years, so a lot of data preparation work was required to make the various data sets compatible and then combine them. The RA had been working on this data preparation for about six months.

Then the the professor decided to run some basic frequencies on the data they had. The results were clearly wrong. The RA must have made a mistake at some point, and they came to us hoping we'd be able to fix the problem. After some discussion, we found that the RA had been doing all his work interactively. He had only a general recollection of what he had done, and had no do files, logs or intermediate data sets to fall back on. Since everything he had created was useless, the project had to be started again from the original data.

The next time we saw her, the professor had a new RA, one who was very careful to do everything using do files.

On a happier note, a grad student once came to the SSCC's statistical consultants because in preparing to present her research she discovered that the values of one variable for three observations had somehow been corrupted. Three observations probably wouldn't change her results, but we didn't really know.

Fortunately she had done everything using do files. We got the data from the source again, checked that it was intact this time, and then she re-ran all her do files. Months of work were replicated in less than 15 minutes, and she was able to proceed with her presentation.

Far more could be said about project management (we haven't even mentioned collaborating with others). You might find J. Scott Long's Workflow of Data Analysis Using Stata helpful.

Exercises

  1. We've looked at how foreign cars and domestic cars differ, but the foreign cars include both European cars and Japanese cars and you might reasonably expect them to differ from each other as well. Classify each car as American, European or Japanese. (Hint: one way to do that would be to identify the manufacturer of each car and then classify the manufacturers.) Then analyze how weight, price, mpg and rep78 differ between these categories using whatever statistical techniques you are comfortable with and think are appropriate (or use the ones discussed in Basic Statistics). Feel free to ignore the small sample size.

    Use good research practices in carrying out this exercise: put all the related files in a singe directory, write at least two do files (one for data work and one for analysis), make them self sufficient, add comments where needed, etc. For extra credit, once they're complete run them again using Condor.

  2. Consider the data sets finalscores and finaldemo. The first contains fictional scores on standardized tests. The second contains information about the fictional students and their families. Examine how household income, whether the student lives with one parent or both parents, and the maximum education attainment of the student's parents (i.e. the educational attainment of the parent with the most education) predict test scores. Again, use good research practices in carrying out this exercise.

    It will be easiest to first create the explanatory variables described using the demographics file, then merge it with the scores. The merge will work best if you drop the non-students first (which you can do once you've figured out what you need to know about the students in their families).

  3. Take your current research project and think through it as described above.

Next: Learning More

Previous: Graphics

Last Revised: 1/7/2015