This is part nine of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.
You now know how to construct useful commands from the components of Stata syntax. Now it's time to learn how to organize those commands into do files, and eventually into research projects.
Do files are simply text files whose names end with .do and which contain Stata commands exactly the way you'd type them into the command window. Do files are also programs, though Stata uses this term for something slightly different (see Stata Programming Tools). The Do File Editor in Stata for Windows is a very nice tool for writing do files, but since they are plain text you can use any text editor you prefer, including TextPad, emacs, vi, or even Notepad (do not use Word).
Let's write a miniature do file just to get the idea: Make sure you still have the automobile data set loaded (sysuse auto if you don't). Open the Do File Editor by clicking the button at the top that looks like a pencil writing in a notebook, or by typing doedit. Then type some commands, perhaps something along the lines of:
tab rep78 foreign, chi2
reg mpg weight foreign displacement
When you're done, click the far right button in the toolbar at the top (it will say Execute (do) if you hold the mouse over it). That will run the do file in your main Stata window--switch to it to see the results. If you get an error message, go back to the Do File Editor, make changes, and try again. This is one of the big advantages of do files: if you make a mistake you only have to change the part that's wrong, not start over.
One major weakness of this do file is that it only works if the automobile data set is already loaded. Real do files should load the data they work with and generally set up anything they need to run. That means the do file should include a use command, but where will it get the data from?
As mentioned earlier, we suggest you make a directory for each project you're working on and put all the data files, do files, log files and any other files relating to the project in that directory. When you run Stata, start by going to the directory of the project you want to work on using the cd command. That way do files can refer to data and log files simply by name, without having to specify a location, because all the files related to the project will be in the current working directory. If you later have to move the project (say, from Windows to Linux) you won't have to change its do files at all, just the cd command you use to get to it.
Every do file should contain certain commands, mostly related to getting set up and saving the results.
If you are using Windows Stata or an interactive Stata session in Linux you should put the following commands at the beginning of every do file:
capture log close
set more off
The first command clears the memory so you don't have to worry about what might have happened before your program was run. The second closes any open log files. The third tells Stata not to pause whenever the screen fills and wait for you to press a key (while saying --more-- at the bottom). These commands aren't necessary if the do file will always be run in batch mode (more on batch mode shortly) but won't hurt.
Every do file should have a corresponding log file which records all the commands the do file ran and their results. If you run your do file in batch mode, the only way to get your results is by reading the log. To start logging, the command is:
log using filename.log, replace
where filename is the name of the file you want Stata to use as a log. We suggest giving the log file the same name as the do file it records, so it's obvious which log file goes with which do file. The replace option tells Stata that if a log file with that name already exists, usually from a previous attempt to run the program, it should be replaced by the current log.
If you do not specify the .log at the end of the filename, Stata will save the log using its Stata Markup and Control Language. SMCL has its uses, but it can only be read by Stata's Viewer. If your filename ends with .log, Stata will save the log as plain text which you can read in any text editor.
Load the data you need with the use command:
Now you're ready to actually do your work: this is where you'll put the Stata commands you want to run, exactly as you'd type them in interactive Stata.
If this do file is for data preparation, you'll need to save your work at the end:
save newDataset, replace
The replace option again allows Stata to overwrite the output from previous attempts to run the do file. Never save the data over the data set you loaded.
The last line of the do file will normally be:
If you don't close the do file's log, commands you run after the do file completes will be logged as if they were part of the do file.
Comments are text included in a do file for the benefit of human readers, not for Stata. Comments should explain what the program is doing and why, and if anyone else ever needs to read and understand your program they'll be very grateful for good comments. But the person who is most likely to benefit from your comments is you, when you try to figure out how your do file works months or years after writing it.
You don't need to comment every command--most Stata code is fairly easy to read. But be sure to comment any code that required some particular cleverness on your part.
Comments need to be marked as such so that Stata will not try to execute them. /* means Stata should ignore everything until it sees */, while // means Stata should ignore the rest of that line. Thus:
gen greatCar=(mpg>25 & rep78>3) if rep78<. | mpg<=25
/* We know a car is not "great" if mpg<=25 even if rep78 is missing */
tab greatCar foreign, chi2 //run chi-squared test
A useful programmer's trick is to "comment out" code you don't want to run right now, but don't want to delete entirely. For example, if you were running:
reg mpg weight foreign i.rep78
and wanted to see what happened if you removed foreign, you could change it to:
reg mpg weight /*foreign*/ i.rep78
If you don't want to run that regression at all, make it:
//reg mpg weight foreign i.rep78
As you've seen, one easy way to run a do file is to open it in the Do File Editor and click the Execute (do) button. You can also type
in the command window, where dofile should be replaced by the name of the do file you want to run (assuming it's in the current working directory).
Windows Stata does not have a batch mode like Linux Stata does. However, you can prevent Stata from wasting CPU time updating the Results window by minimizing Stata once your do file starts. This will make a big difference if your do file prints a lot on the screen.
If your job requires more memory than Windows can provide, if it will run for more than a few hours, or if you want to run many jobs at a time, you should switch to Linux Stata. While the idea of using Linux scares many people, it's actually quite easy to run Stata jobs on the SSCC Linux servers.
In Linux Stata you can also submit a job in batch mode. Batch mode Stata doesn't waste CPU time drawing windows or putting results on the screen. It simply starts up, runs your do file, and quits when it is done without any further intervention. You get the results by reading the log file.
stata -b do dofileat the command prompt, where dofile should be replaced by the name of the do file you want to run
If your do file will take more than a few minutes to run, consider submitting it to Condor by typing:
condor_stata -b do dofile
Consistently following a few best practices while doing your research can save a tremendous amount of time and reduce the probability of making serious mistakes.
In a typical situation you have a research question you want to answer and some data that you think will answer it, but the data isn't in a form that can actually answer the question--yet.
Everything you do depends on your data, so you need to know your data extremely well. How was the data collected? What does an observation represent? What variables are available? Is there any hierarchy or other structure? Which variables are continuous, categorical, binary or text? How are they distributed? Is there missing data? How much? Is there structure to the missing data ("People who answered X to question Y then skipped to question Z")?
Real mastery (i.e. knowing the answers to most if not all of these questions without having look them up) will come as you're working on your project, but you should spend a significant amount of time learning about the data before proceeding. Sometimes the answers to these questions can affect the feasibility of your research agenda. For example, crosstabs may reveal that you have too few observations in certain cells to get meaningful results. ("Not enough women with PhDs in science and engineering" can be a statistical problem as well as a social problem!)
Before you write any code, figure out what form the data needs to be in so you can analyze it. What should an observation represent? What variables will each observation need to contain? The answers to these questions will most likely be determined by the statistical techniques you plan to use. Thinking it through ahead of time will prevent you from spending time putting the data in a form that seems natural, but doesn't actually meet your needs.
Once the goal is clear in your mind, don't try to write one massive do file that gets you there in one step, only trying to run it once it's "done." If you do, the do file will most likely have a large number of bugs. Then you may find that in order to make one part work, you need to do something in a different way than you originally planned. You'll then have to change everything that follows.
It's far better to write a bit of code, test and debug it, then write a little more, test and debug it, and so forth. But if a do file gets too big, you waste time rerunning old code you know is good in order to test what you just added. When that happens it's time to start a new do file.
Do files for data preparation will start with an input data file, modify it, and save the result. However you should never have a do file save its output over its input. If you do, you can never run that do file again because the input it was written to process is now gone. If it turns out that the do file contained an error, you may be reduced to asking the Help Desk to restore the input data from backup.
Programming errors can be subtle and very difficult to catch by staring at your code. The only way to realistically know if something worked is to look at the results. Checking summary statistics and/or frequencies can be a good way to detect errors if you have a sense of what they should be. They also have the advantage of looking at the entire data set, though in a limited way.
The best way to check for errors is to look at individual cases. Use the browse command, probably with a varlist to pick out the variables of interest, and compare what your do file did with what you meant to do. If you have different types of cases, be sure to look at samples of each. If you examine enough cases and don't find any errors you can be reasonably confident that your do file is working properly.
If you do find problems, looking at cases is the best way to solve them. What kinds of cases get the wrong answers? Which variables are wrong? Figuring out the details will point you to the particular commands that need to be corrected.
While data preparation tends to be a linear process, with each step depending on what came before (first recode, then clean...), analysis often branches out (maybe ordinary linear regression is okay, but I'll try a mixed model too...). Thus it's best to keep the two separate.
For most projects there should be a "final" data set that's used for all analysis. That way you can open it up interactively and try things, write do files that analyze it in different ways, and generally experiment at will without running the risk of forgetting that (for example) the do file that ran the linear regressions also did a bit more recoding.
Of course this does not apply to frequencies, summary statistics, etc. that you use in the course of preparing the data, to learn about the data set and check your work. Those can go wherever they're needed.
With proper organization you should be able to reproduce your entire project at will.
Start with the data as you obtained it. Your first do file will read it in, make some changes, and save the results in a separate file. Your second do file will read in the output from the first do file, make further changes, and then save its results in another separate file. Repeat until your project is done. If your data files are large, you can delete all the copies of it but the original and the input data for the do file you're currently working on. If you follow this procedure you can recreate everything you've done at will just by rerunning all your do files. (It's also a good idea to make a "readme" file for each project with information like what order its do files must be run in.)
This method is also extremely helpful for debugging. If you discover a problem, you can go back through your do files one by one until you find the error, fix it, and then rerun the corrected code for your entire project with just a few commands. It's also easy to make changes.
Two stories that illustrate the importance of proper project management:
One day a professor and her research assistant came to the SSCC Help Desk. They were working with census data from many countries over many years, so a lot of data preparation work was required to make the various data sets comparable and then combine them. The RA had been working on this data preparation for about six months.
Then they decided to run some basic frequencies on the data they had. The results were clearly wrong. The RA must have made a mistake at some point, and they came to us hoping we'd be able to fix the problem. After some discussion, we found that the RA had been doing all his work interactively. He had only a general recollection of what he had done, and had no do files, logs or intermediate data sets to fall back on. Since everything he had created was useless, the project had to be started again from the original data.
The next time I saw her, the professor had a new RA, one who was very careful to do everything using do files.
On a happier note, a grad student once came to the SSCC Help Desk because in preparing to present her research she discovered that the values of one variable for three observations had somehow been corrupted. Three observations probably wouldn't change her results, but we didn't really know.
Fortunately she had done everything using do files. We got the data from the source again, checked that it was intact this time, and then she re-ran all her do files. Months of work were replicated in less than 15 minutes, and she was able to proceed with her presentation.
Far more could be said about project management (we haven't even mentioned collaborating with others). You might find J. Scott Long's Workflow of Data Analysis Using Stata helpful.
We've looked at how foreign cars and domestic cars differ, but the foreign cars include both European cars and Japanese cars and you might reasonably expect them to differ from each other as well. Classify each car as American, European or Japanese. (Hint: one way to do that would be to identify the manufacturer of each car and then classify the manufacturers.) Then analyze how weight, price, mpg and rep78 differ between these categories using whatever statistical techniques you are comfortable with and think are appropriate (or use the ones discussed in Basic Statistics). Feel free to ignore the small sample size.
Use good research practices in carrying out this exercise: put all the related files in a singe directory, write at least two do files (one for data work and one for analysis), make them self sufficient, add comments where needed, etc. For extra credit, once they're complete run them again using Condor.
Consider the data sets finalscores and finaldemo. The first contains fictional scores on standardized tests. The second contains information about the fictional students and their families. Examine how household income, whether the student lives with one parent or both parents, and the maximum education attainment of the student's parents (i.e. the educational attainment of the parent with the most education) predict test scores. Again, use good research practices in carrying out this exercise.
It will be easiest to first create the explanatory variables described using the demographics file, then merge it with the scores. The merge will work best if you drop the non-students first (which you can do once you've figured out what you need to know about the students in their families).
Next: Learning More
Last Revised: 12/15/2011