Stata for Researchers: Do Files and Project Management

You now know how to construct useful commands from the components of Stata syntax. Now it's time to learn how to organize those commands into do files, and eventually into research projects.

Do Files

Do files are simply text files whose names end with .do and which contain Stata commands exactly the way you'd type them into the command window. Do files are also programs, though Stata uses this term for something slightly different (see Programming in Stata). The Do File Editor in Stata for Windows is a very nice tool for writing do files, but since they are plain text you can use any text editor you prefer, including TextPad, emacs, vi, or even Notepad (do not use Word.)

Let's write an miniature do file just to get the idea: Make sure you still have the automobile data set loaded (sysuse auto if you don't). Open the Do File Editor by clicking the button at the top that looks like a pencil writing in a notebook, or by typing doedit. Then type some commands, perhaps something along the lines of:

tab rep78 foreign, chi2
reg mpg weight foreign displacement

When you're done, click the far right button in the toolbar at the top (it will say Execute (do) if you hold the mouse over it). That will run the do file in your main Stata window--switch to it to see the results. If you get an error message, go back to the Do File Editor, make changes, and try again. This is one of the big advantages of do files: if you make a mistake you only have to change the part that's wrong, not start over.

Using Files Within Do Files

One major weakness of this do file is that it only works if the automobile data set is already loaded. Real do files should load the data they work with and generally set up anything they need to run. That means the do file should include a use command, but where will it get the data from?

As mentioned earlier, we suggest you make a directory for each project you're working on and put all the data files, do files, log files and any other files relating to the project in that directory. When you run Stata, start by going to the directory of the project you want to work on using the cd command. That way do files can refer to data and log files simply by name, without having to specify a location, because all the files related to the project will be in the current working directory. If you later have to move the project (say, from Windows to Linux) you won't have to change its do files at all, just the cd command you use to get to it.

Anatomy of a Do File

Every do file should contain certain commands, mostly related to getting set up and saving the results.

Set up in Interactive Stata

If you are using Windows Stata or an interactive Stata session in Linux you should put the following commands at the beginning of every do file:

clear all
capture log close
set more off

The first command clears the memory so you don't have to worry about what might have happened before your program was run. The second closes any open log files. The third tells Stata not to pause whenever the screen fills and wait for you to press a key (while saying --more-- at the bottom). These commands aren't necessary if the do file will always be run in batch mode (more on batch mode shortly) but won't hurt.

Declare a Version

Next, tell Stata what version of Stata the do file was written for:

version 11

This tells future versions of Stata to interpret all subsequent commands the way the way they were interpreted in Stata 11, so you won't have to modify your do files when a new version comes out. (In reality it's very rare that you'd have to, but better safe than sorry.)

Start a Log

Every do file should have a corresponding log file which records all the commands the do file ran and their results. If you run your do file in batch mode, the only way to get your results is by reading the log. To start logging, the command is:

log using filename.log, replace

where filename is the name of the file you want Stata to use as a log. We suggest giving the log file the same name as the do file it records, so it's obvious which log file goes with which do file. The replace option tells Stata that if a log file with that name already exists, usually from a previous attempt to run the program, it should be replaced by the current log.

If you do not specify the .log at the end of the filename, Stata will save the log using its Stata Markup and Control Language. SMCL has its uses, but it can only be read by Stata's Viewer. If your filename ends with .log, Stata will save the log as plain text which you can read in any text editor.

Load your Data

Load the data you need with the use command:

use dataset

Do your Work

Now you're ready to actually do your work: this is where you'll put the Stata commands you want to run, exactly as you'd type them in interactive Stata.

Save your Data

If this do file is for data preparation, you'll need to save your work at the end:

save newDataset, replace

The replace option again allows Stata to overwrite the output from previous attempts to run the do file. Never save the data over the data set you loaded.

Close your log

The last line of the do file will normally be:

log close

If you don't close the do file's log, commands you run after the do file completes will be logged as if they were part of the do file.

Comments

Comments are text included in a do file for the benefit of human readers, not for Stata. Comments should explain what the program is doing and why, and if anyone else ever needs to read and understand your program they'll be very grateful for good comments. But the person who is most likely to benefit from your comments is you, when you try to figure out how your do file worked months or years after writing it.

You don't need to comment every command--most Stata code is fairly easy to read. But be sure to comment any code that required some particular cleverness on your part.

Comments need to be marked as such so that Stata will not try to execute them. /* means Stata should ignore everything until it sees */, while // means Stata should ignore the rest of that line. Thus:

gen greatCar=(mpg>25 & rep78>3) if rep78<. | mpg<=25
/* We know a car is not "great" if mpg<=25 even if rep78 is missing */
tab greatCar foreign, chi2 //run chi-squared test

A useful programmer's trick is to "comment out" code you don't want to run right now, but don't want to delete entirely. For example, if you were running:

reg mpg weight foreign i.rep78

and wanted to see what happened if you removed foreign, you could change it to:

reg mpg weight /*foreign*/ i.rep78

If you don't want to run that regression at all, make it:

//reg mpg weight foreign i.rep78

Exercise

  1. Take the do file you wrote initially and turn it into a full-fledged do file by adding the elements described above. Test and debug it until it runs successfully.

Running Do Files in Windows

As you've seen, one easy way to run a do file is to open it in the Do File Editor and click the Execute (do) button. You can also type

do dofile

in the command window, where dofile should be replaced by the name of the do file you want to run (assuming it's in the current working directory).

Windows Stata does not have a batch mode like Linux Stata does. However, you can prevent Stata from wasting CPU time updating the Results window by minimizing Stata once your do file starts. This will make a big difference if your do file prints a lot on the screen.

Running Do Files in Linux

If your job requires more memory than Windows can provide, if it will run for more than a few hours, or if you want to run many jobs at a time, you should switch to Linux Stata. While the idea of using Linux scares many people, it's actually quite easy to run Stata jobs on the SSCC Linux servers.

  1. Move your project's directory to the Z: drive, which is your Linux home directory
  2. Log into any SSCC Linux server using X-Win32 (or an alternative program if you're not using a PC)
  3. Type xstata
  4. Change to your project directory using cd (cd project)
  5. Go to work just as if you were using Windows Stata

In Linux Stata you can also submit a job in batch mode. Batch mode Stata doesn't waste CPU time drawing windows or putting results on the screen. It simply starts up, runs your do file, and quits when it is done without any further intervention. You get the results by reading the log file.

  1. Log into Linux as described above
  2. Type cd project at the Linux command prompt (cd is a Linux command as well as a Stata command)
  3. Type

    stata -b do dofile

    at the command prompt, where dofile should be replaced by the name of the do file you want to run

If your do file will take more than a few minutes to run, consider submitting it to Condor by logging into Kite and typing:

condor_stata -b do dofile

The SSCC has a tremendous amount of computing power available through our Condor flock. See An Introduction to Condor for more information. We'd also suggest reading Running Jobs in Linux.

Organizing Your Research Project

Consistently following a few best practices while doing your research can save a tremendous amount of time and reduce the probability of making serious mistakes.

In a typical situation you have a research question you want to answer and some data that you think will answer it, but the data isn't in a form that can actually answer the question--yet.

Begin with the End in Mind

The first thing you should do is figure out what form the data needs to be in so you can analyze it. What should an observation represent? What variables will each observation need to contain? The answers to these questions will most likely be determined by the statistical techniques you plan to use. Thinking it through ahead of time will prevent you from spending time putting the data in a form that seems natural, but doesn't actually meet your needs.

Don't Try to do Everything at Once

Once the goal is clear in your mind, don't try to write one massive do file that gets you there in one step, only trying to run it once it's "done." If you do, the do file will most likely have a large number of bugs. Then you may find that in order to make one part work, you need to do something in a different way than you originally planned. You'll then have to change everything that follows.

It's far better to write a bit of code, test and debug it, then write a little more, test and debug it, and so forth. But if a do file gets too big, you waste time rerunning old code you know is good in order to test what you just added. When that happens it's time to start a new do file.

Never Write your Output over your Input

Do files for data preparation will start with an input data file, modify it, and save the result. However you should never have a do file save its output over its input. If you do, you can never run that do file again because the input it was written to process is now gone. If it turns out that the do file contained an error, you may be reduced to asking the Help Desk to restore the input data from backup.

Separate Data Preparation from Analysis

While data preparation tends to be a linear process, with each step depending on what came before (first recode, then clean...), analysis often branches out (maybe ordinary linear regression is okay, but I'll try a mixed model too...). Thus it's best to keep the two separate.

For most projects there should be a "final" data set that's used for all analysis. That way you can open it up interactively and try things, write do files that analyze it in different ways, and generally experiment at will without running the risk of forgetting that (for example) the do file that ran the linear regressions also did a bit more recoding.

Of course this does not apply to frequencies, summary statistics, etc. that you use in the course of preparing the data, to learn about the data set and check your work. Those can go wherever they're needed.

Make your Project Reproducible

With proper organization you should be able to reproduce your entire project at will.

Start with the data as you obtained it. Your first do file will read it in, make some changes, and save the results in a separate file. Your second do file will read in the output from the first do file, make further changes, and then save its results in another separate file. Repeat until your project is done. If your data files are large, you can delete all the copies of it but the original and the input data for the do file you're currently working on. If you follow this procedure you can recreate everything you've done at will just by rerunning all your do files. (It's also a good idea to make a "readme" file for each project with information like what order its do files must be run in.)

This method is also extremely helpful for debugging. If you discover a problem, you can go back through your do files one by one until you find the error, fix it, and then rerun the corrected code for your entire project with just a few commands. It's also easy to make changes.

Case Studies

Two stories that illustrate the importance of proper project management:

One day a professor and her research assistant came to the SSCC Help Desk. They were working with census data from many countries over many years, so a lot of data preparation work was required to make the various data sets comparable and then combine them. The RA had been working on this data preparation for about six months.

Then they decided to run some basic frequencies on the data they had. The results were clearly wrong. The RA must have made a mistake at some point, and they came to us hoping we'd be able to fix the problem. After some discussion, we found that the RA had been doing all his work interactively. He had only a general recollection of what he had done, and had no do files, logs or intermediate data sets to fall back on. Since everything he had created was useless, the project had to be started again from the original data.

The next time I saw her, the professor had a new RA, one who was very careful to do everything using do files.

On a happier note, a grad student once came to the SSCC Help Desk because in preparing to present her research she discovered that the values of one variable for three observations had somehow been corrupted. Three observations probably wouldn't change her results, but we didn't really know.

Fortunately she had done everything using do files. We got the data from the source again, checked that it was intact this time, and then she re-ran all her do files. Months of work were replicated in less than 15 minutes, and she was able to proceed with her presentation.

Far more could be said about project management (we haven't even mentioned collaborating with others). You might find J. Scott Long's Workflow of Data Analysis Using Stata helpful.

Exercise

  1. We've looked at how foreign cars and domestic cars differ, but the foreign cars include both European cars and Japanese cars and you might reasonably expect them to differ from each other as well. Classify each car as American, European or Japanese. (Hint: one way to do that would be to identify the manufacturer of each car and then classify the manufacturers. Feel free to take advantage of the foreign variable as well.) Then analyze how weight, price, mpg and rep78 differ between these categories using whatever statistical techniques you are comfortable with and think are appropriate (or use the ones discussed in Basic Statistics). Feel free to ignore the small sample size.

    Use good research practices in carrying out this exercise: put all the related files in a singe directory, write at least two do files (one for data work and one for analysis), make them self sufficient, add comments where needed, etc. For extra credit, once they're complete run them again using Condor.

  2. Take your current research project and think through it as described above.

Next: Learning More

Previous: Graphics

Last Revised: 10/8/2009