Introduction to Stata: Do Files

This is part four of Introduction to Stata. If you're new to Stata we highly recommend starting from the beginning.

Do files are simply text files whose names end with .do and which contain Stata commands. Sometimes people call them programs, though Stata uses this term for something else.

Working in Stata involves three different things: the commands you run, the data they act on, and the results they produce. A properly written do file manages all three. It contains all the commands needed to carry out its work. It loads the appropriate data and saves new versions of the data when needed. And it stores all the results in a permanent log file.

Do files are the key to reproducibility: a properly written do file (or set of do files) will reproduce your research at will. But they also make working with Stata much more efficient. At the most basic level, writing a do file ensures you can quit for the night and pick up where you left off the next day without having to start over. If you change your mind about what you want to do, simply change the do file and run it again. If you find you've made a mistake, fix it in the do file and run it again.

Start the Do File Editor by clicking on the button that looks like a pencil writing in a notebook or by typing doedit.

Anatomy of a Do File

Almost all do files carry out the same basic steps.

Create a Log File to Store Results

The first thing your do file should do is set up a log file which will store its results. Make sure that no previous log files are still open with:

capture log close

This is important because if your do file crashes before it gets to the command to close its log at the end, it will leave the log file open.

The capture prefix tells Stata to ignore any error messages the following command produces. In this case, we use it because we want the do file to proceed whether there's an open log to close or not.

Then open a new log file. We suggest giving a log file the same name as the do file whose results it records, so there's never any confusion about which log goes with which do file. To give your do file a name, press Ctrl-s or click File, Save as and call it first.do, being sure to save it in Stata's working directory (the one in the lower left corner of the main Stata window). Then go back to the do file itself and type:

log using first.log, replace

The replace option tells Stata it's okay to replace previous versions of that file. Specifying the .log extension tell Stata you want a plain text log, which can be used by many programs.

Clear Stata's Memory

Another key to reproducibility is always starting with a blank slate, so the next command should be:

clear all

This clears out any data or stored results from whatever you were doing before running this do file.

Open a Data Set

In general, you'll load data with the use command. However, since we're using the auto data set that comes with Stata, you'll open it with sysuse:

sysuse auto

Every time you run this do file, it will load a fresh copy of the data from disk into memory. This means you don't have to worry about any mistakes you might have made previously or keep track of the current state of data set in general.

Do Your Work

You're now ready to do your work. For now just add:

list make if foreign

Of course real do files will have many more (and much more useful) commands at this point.

Save your Data

Do files that carry out data wrangling will change the data set, and need to save the new version of the data at the end. This do file does not change the data, but save it anyway for practice:

save autoV2, replace

The replace option again allows Stata to overwrite the output from previous attempts to run the do file.

Never, ever save your output data set over your input data set. (In other words, the starting use command and the ending save command should never act on the same file.) If you do, the data set your do file was written to work with will no longer exist. The do file may not run at all, and if it does it most likely won't give the same results. If it turns out you made a mistake, you may have to go back to your raw data and start over.

Close your Log

The last line of the do file will normally be:

log close

If you don't close the do file's log, any commands you run after the do file finishes will be recorded in the the log. This includes if your do file crashes before reaching the log close command (hence capture log close at the beginning).

Running a Do File

The easiest way to run a do file is to press Ctrl-d in the Do File Editor, or click the icon on the far right that looks like a "play" button over some code. If you first select just part of the do file then only that part will be run.

Running parts of your code rather than the entire do file can be useful, but code taken out of context won't always work. For example, if you run a command that creates a variable x, realize you made a mistake, and then fix it, you can't simply select the command that creates x and run it again because x already exists. You could drop the existing version of x, but that step that won't be part of your final do file. Running the entire do file will eliminate this problem because it reloads the data from disk every time. If you find yourself getting confused by these kinds of issues, run the entire do file rather than a selection.

You can also tell Stata to run a do file with the do command followed by the name of the do file to run. This means do files can run other do files. For complicated projects it can be very helpful to have a master do file that runs all the other do files in the proper sequence.

Output Files

Because you did not tell Stata where to put first.log or autoV2.dta, Stata saved them in its working directory (the directory in the lower left corner of the main Stata window). Go to that location and open the first.log file and you should see everything that your do file put in the Results window, but stored as a permanent file. While autoV2.dta is just a copy of auto.dta, if your do file had improved the data set the new version would be ready for use as the input for your next do file.

How long should a do file be?

For data preparation work, it's easy to "daisy-chain" do files: dofile1 loads dataset1, modifies it, and saves it as dataset2; dofile2 loads dataset2, modifies it, and saves it as dataset3, etc. When you're done, a master do file can run them all. Thus there's very little downside to breaking up one long do file into two or more short do files. Our suggestion is that you keep your do files short enough that when you're working on one of them you can easily wrap your head around it. You also want to keep do files short so they run as quickly as possible: working on a do file usually requires running it repeatedly, so moving any code that you consider "done" to a different do file will save time.

Comments

Comments are text included in a do file for the benefit of human readers, not for Stata. Comments can explain what the do file does and why, and if anyone else ever needs to read and understand your do file they'll be very grateful for good comments. But you are the most likely beneficiary of your comments, when you have to figure out how your do file works months or years after writing it.

You don't need to comment every command—most Stata code is fairly easy to read. But be sure to comment any code that required particular cleverness to write, or you'll need to be just as clever to figure out what it does later.

Comments need to be marked as such so that Stata will not try to execute them. /* means Stata should ignore everything until it sees */, while // means Stata should ignore the rest of that line. Here's an example of commenting code:

// make a list of cars I might be interested in buying
list make price mpg rep78 if price<4000 | (price<5000 & rep78>3)
/*
I'm mostly interested in cheap cars,

but I'll pay more for a car with a good repair record

A useful programmer's trick is to "comment out" code you don't want to run right now but don't want to delete entirely. For example, if you temporarily wanted to focus on just the cars that meet the price<4000 condition, you could change that command to:

list make price mpg rep78 if price<4000 // | (price<5000 & rep78>3)

When you're ready to return to the original command, just remove the comment markers.

Three forward slashes (///) means that the current command is continued on the next line. This allows you to break up commands over multiple lines for readability:

list make price mpg rep78 ///
if price<4000 | (price<5000 & rep78>3)

From now on we'll do everything using do files.

Next: Statistics

Last Revised: 5/22/2020