|
Stata is the most popular program for statistical analysis at
the SSCC, as it is both extremely powerful and relatively easy
to learn. Its straightforward but flexible syntax makes it a
good choice for data manipulation and management, and it implements
a very large number of statistical models and techniques. Stata
also has a an extensive user community which has made
a great deal of code available for free, including many
additional estimators. We've been quite pleased with Stata at
the SSCC, and we think you'll find it extremely useful.
There are two different approaches one can take to Stata. One
is to use it as an interactive tool: you start Stata, load your
data, and start typing or clicking on commands. This is an excellent
way to learn Stata; thus it's how you'll spend most of your
time as you work through this publication. It is also a good
way to explore your data, figure out what you want to do, and
check that your programs worked properly. However, interactive
work cannot be easily or reliably replicated, or modified if
you change your mind. It's also very difficult to recover from
mistakes--there's no "undo"
command in Stata.
The other approach is to treat Stata as a programming language.
In this approach you write your programs, called do files, and
then run them. A do file contains exactly the same Stata commands
you'd type in interactive Stata, but since they're all written
up in a permanent file they can easily be rerun, modified, checked
for errors, or debugged. They also serve as an exact record
of how you obtained your results--a sort of lab notebook for
the social scientist. I feel very strongly that any work you
intend to publish or present should be done using do files. Thus
this publication will for the most part ignore Stata's graphical
user interface and focus on preparing you to write do files for
research.
The goal of this publication is to give you a solid
foundation in Stata that you can then build on to become an expert
Stata user. If your goal is to learn just enough Stata to get
you through a particular course you might want to consider reading
something like Alan Acock's book "A Gentle Introduction to
Stata" instead.
This publication contains the following sections:
- Running Stata
- Getting Started
- Stata Commands
- Working with Data
- Commands to Examine Data
- Renaming and Labeling Variables
- Creating and Modifying Variables
- Analysis
- Graphs
- Do Files
- Organizing Your Research Project
- Learning More
Running Stata
The SSCC makes Stata available on Winstat and our Linux servers.
For details about the capabilities of the SSCC's servers see
Computing Resources at the SSCC. You can
find out how busy the various Linux servers are by visiting our server
status web page (Winstat always directs you to the least busy
server). Windows Stata and Linux Stata look and act
the same, and you can even write your programs in Windows and
run them in Linux--See Running
Linux Programs Using Windows (Mostly) for details. Linux Stata
is significantly faster however, partly because of the nature
of Linux but mostly because the SSCC's Linux servers run Stata/MP
which uses multiple processors. You can also submit Stata jobs
to the SSCC's Condor flock either
from Linux or from
the web.
To start Stata on a Winstat, click on ,
, , and then .
To start Stata on a Linux server, type xstata.
This requires X-Windows graphics to run. If you're connecting
to Linux from a PC or from Winstat you will need to use
X-Win32 to display Linux graphics: see Connecting
to SSCC Linux Computers using X-Win32.
The Stata Interface
You'll see something
like this:
The window on the bottom right with no label is where you'll enter
commands. When you press Enter, they are pasted into the window
above. This is where you will see your commands execute and
view the results.
On the left are two convenience windows.
keeps a list of your current variables. If you click on one of
them, its name will be pasted into the current command at the
location of the cursor, which saves a little typing. The
window keeps a list of all the commands you've typed this Stata
session. Click on one, and it will be pasted into the command
window, which is handy for fixing typos. Double-click, and the
command will be pasted and re-executed. You can also recall previous
commands by pressing Page Up. You
can export everything in the
window into a do
file by right-clicking on it, but this includes any mistakes
you made.
Getting Started
Start up Stata on the server you've chosen. You should be seeing
the graphical user interface just like the picture above.
Memory
Stata loads your entire data set into memory, but by default it
sets aside just ten megabytes to store it. This is enough
for many data sets (including the trivial one we'll deal with
in our examples) but for real work you'll often need to set it
much higher. This is done by typing:
set mem size
The default unit for memory sizes is kilobytes, but you'll probably
want to use megabytes. Just add "m" to the number. For
example to claim 100 megabytes type:
set mem 100m
If you don't know how big your data set is you can find out using
Stata's ls command, which we'll discuss
shortly. Set Stata's memory to about 25% to 50% larger than the
data set you'll be using, depending on what you plan to do
with it (in particular, whether you'll be adding new variables).
If you try to set the memory too high,
you'll get the message:
If you get this and really need that much memory, the first thing
you should do is switch to Linux because it can provide more
memory. If the standard Linux servers fail, go to Falcon. It
runs 64-bit Linux and can provide even more memory. If Falcon
can't provide enough memory you need to rethink your strategy.
Are there variables or observations in your data set that you aren't
using? Could you split the data set into sections and process
each section separately? This may be a good time to visit the consultant for
advice.
Note that when you resize the memory any data currently in memory
would be lost. Get in the habit of setting memory first, before
loading anything.
Finding and Loading Data
Now you're ready to load some data. Stata can access the entire
file system of the computer it is loaded on, and uses Linux-style
directory navigation to move around. However, this means that,
like Linux, Stata has a hard time with file and directory names
that have spaces in them. If the file or directory you need has
a space in it, you must put the entire path in quotes. On the
other hand, Stata doesn't care if you use forward slashes (/)
or backslashes (\) to separate directories.
Stata Corp. thoughtfully includes some sample data with the Stata
program and we'll use it extensively. Let's start by using the cd (change
directory) command to navigate
to the directory where Stata is installed. On Linux that's /software/stata.
Type:
cd /software/stata
On the Winstats, you need c:\program files\stata10,
and program files has a space in
it. So type:
cd "c:\program files\stata10"
If you are not using one of the SSCC's servers Stata may be installed
in the same location, or you may need to look around (c:\stata
is another popular choice). You can also get the example data directly
from Stata's web site, as you'll see in a moment.
Next see what's here using the ls
(list) command. Just type:
ls
The file you want is called auto.dta
(.dta is the standard extension
for Stata data sets). Its size is listed as 5.8k, meaning
5.8 kilobytes, so the default memory size of 10 megabytes is very
much more than adequate. To load it type:
use auto
Note that you didn't have to type the .dta;
Stata assumed it. There's just one trick to the use
command: if you already have data in memory, and if you've made
any changes at all since you loaded it, Stata will refuse to replace
it with another data set unless you specifically tell it to do
so. You can do this in two ways. One is to type clear before
typing use, thus removing all the
current data from memory. The other is to add the replace
option to the use command
(more on options in a moment). To do that, type:
use auto, replace
With the replace option,
the new data will replace the old with no complaints.
Yes, you can also load data by clicking on ,
, etc. But when you start writing
do files you'll need to use the
use command,
so you might as well start now.
Stata can open a data set from the web as easily as from your
local hard drive. For example, you can get this exact same data
set by typing:
use http://www.stata-press.com/data/r10/auto.dta,
replace
Stata Commands
Now that you've given a Stata command let's talk about how they
work. The general form of a Stata command is this:
[by varlist:]
command [varlist] [=expression]
[if expression]
[in range]
[statistical weights] [,
options]
Brackets mean that element may or may not be there in a given
command. Some commands require some elements or cannot use others.
We'll go through most of these elements using the list
command as an example. Type:
list
The list command, unsurprisingly,
lists your data. You'll get quite a bit to look at even with
the small example data set--don't try this with census data!
There are ways to list just what you want, but for now quit the
current command by pressing q or
clicking on the red, stop-sign shaped button with the white
X on it near the top of the screen.
list can
be abbreviated as just l. As
you'll see, many Stata commands have abbreviations.
Varlists
If you give a command a varlist the
command will be executed just for the variables in the varlist.
Type:
l make
make is one of the variables in
this data set. When you type l make it
lists just the make of each car.
As the name suggests a varlist can
include multiple variables. Try typing:
l make price mpg
If
An if condition specifies which observations the command should
act on:
l make mpg if mpg==25
This gives you a list of just those cars which got exactly 25
miles per gallon. Note that you had to type two equals signs.
Stata, like most computer languages, understands two different
meanings for "equals." One equals sign means assignment: mpg=25 means
"make mpg 25." Two equals signs is for testing: mpg==25
asks
"is mpg equal to 25 or not?" This will drive you crazy
for about a week and then it will become second nature.
Also note the order: l
if mpg==25 make won't
work.
The exclamation point is used for "not." != means
"not equals" but you can also use it by itself. For example,
try:
l make mpg if mpg!=25
l make mpg if mpg>25
l make mpg if !mpg>25
The exclamation point can also be thought of as reversing the
following condition: changing false to true and true to false.
if conditions can be very complicated
and often making a program work will come down to crafting the
exact condition that will identify the observations you need.
Logical and is
denoted by & and
logical or is denoted by | (the pipe
character, which you get by pressing Shift-\). Use parentheses
liberally to avoid getting confused about the precedence of logical
operations. Try:
l make if (price<4000) | (price<5000 &
mpg>30)
This gives you a list of cars someone might buy if they wanted
to pay less than $4000 but were willing to go up to $5000 if
the car got more than 30 miles per gallon. (Note that these prices are
in 1978 dollars!)
In
in allows you to specify the
observations the command should act on by observation number.
For example, to see the make of the
first three observations, type:
l make in 1/3
1/3 is Stata's shortcut for the
integers one, two and three, so what you see are observations
one, two and three. If you give negative numbers, Stata will
count from the end of your data set. So to see the makes of the
last three observations, type:
l make in -3/-1
Note the order: the numbers still go from smallest to largest.
This is because in this data set -3 really means observation
72 (third from last) and -1 is really 74 (the last). in
is handy if you just want to see a few random observations to check
results, but it's especially useful if the order of the observations
means something (for example, if the head of a household is always
the first observation in the household).
An Aside on Value Labels
To learn by: we'll focus on the foreign variable,
but there's something important you need to notice about it
first:
foreign has value labels assigned
to it. If you just type:
l foreign
it appears that foreign is a
string of characters just like
make. This is deceptive. To see what's
really going on add the nolabel option. Options affect
how commands are executed. Some are unique to a certain command,
but others apply to many commands. As described by the generic
command syntax, options always come at the end of the command,
following a comma:
l foreign, nolabel
The variable foreign is actually
stored as an indicator variable (0 or 1) but a value label has
been defined so that 0 is displayed as and
1 is displayed as .
Value labels are very convenient, but it's important to notice
them. For example,
l make if foreign=="Domestic"
won't work. The syntax is correct (including putting character
strings like "Domestic" in
quotes) but you'll get a type mismatch because foreign is actually
a number and you can't compare a number to a character string. The
correct command is:
l make if foreign==0
By:
By: is used to run a command separately for different
groups. For example, list the domestic cars and the foreign cars
separately by typing:
by foreign: l make foreign
Note how the list is broken into two parts. The first one says
at the top, the second says .
By: splits the data set up into separate
groups, one group for each unique value of the by:
variable, then executes the command for each group.
Since by: takes a varlist,
you can use more than one variable at a time. Try both foreign and rep78, a measure of the car's repair record on a five-point
scale:
by foreign rep78: l make
You'll get the message
Stata can only use by: if the data
set is sorted by the varlist. This
data set started out sorted by foreign,
but not by rep78. Annoying, but
hardly fatal. Type:
sort foreign rep78
and Stata will sort the data and allow you to execute by
foreign rep78: l make successfully. As you can see it breaks
the data set into one group for each unique combination
of foreign
and rep78 and then carries out the
command.
Users got rather tired of that error message, so Stata provided
a shortcut:
bysort foreign rep78: l make
This will first sort the data by foreign
and rep78, then carry out the rest
of the command.
Working with Data
Now we'll move from Stata commands in and of themselves to how
Stata thinks about your data--and how you can take advantage
of that in order to make your programs work.
Missing Values
You may not have expected two of the groups in the last list: and . That's because rep78
has a possible value in addition to the numbers one through five:
it can be missing, which Stata denotes by a period ().
A Stata data set is just a big matrix. Each observation is a row
and each column is a variable. The matrix is always rectangular
and can't have "holes" in it, so if an observation doesn't have
a value for a variable Stata stores missing ().
Internally, missing is stored as +infinity, which cuts down the
special code needed to deal with it but often causes confusion.
When using by: missing is just another
value and gets its own by: group as
you saw. If you are doing math, any expression that includes
a missing value gives a missing value for a result. So if you
defined a new variable (we'll learn how in a bit) to be rep78+mpg
the new variable would be missing for any observation in
which
rep78 or mpg is
missing.
To see how missing values can confuse people, type in:
l make rep78 if rep78>3
What you might expect is that rep78 would
be all 4's and 5's, because rep78 can only be an integer between
1 and 5 inclusive. But missing is coded as +infinity, which is
definitely greater than 3. Thus observations with rep78 missing
are included in the list. This can cause major headaches if
not handled properly. For example, imagine trying to identify
people over 65 with the condition if age>65.
You'd end up declaring everyone whose age is missing to be "over
65." There have been many proposals to change the way
Stata handles missing values in if conditions, but
none of them have really been any better. The bottom line is
that you have to think about missing values and how to handle
them.
The easiest way is to make sure there are
no missing values. Stata has a handy command called assert
which can do just that. You give it a condition and it will check
whether the condition is true for all observations or not. Type:
assert mpg!=.
This asserts that mpg is never missing. When you type it, nothing
happens. That means the assertion is true. From now on you
never have to worry about the possibility that mpg
is missing. Now try:
assert rep78!=.
This time Stata complains at you. If you had put this in a do
file your do file would have crashed--which is good. It's better
for a do file to crash than for it to do something
that doesn't make sense because you erroneously thought there
were no missing values in your data.
What you do with the missing values you have depends on what
you're trying to do with your data. Let's assume that in this
case you wanted a list of cars that were known to have rep78 greater
than 3. Thus missing values should be excluded. Then type:
l make rep78 if rep78>3 & rep78!=.
Now you'll get the list of 4's and 5's you wanted before.
This has a weakness though: Stata also allows you to track different
kinds of missing as .a, .b,
up through .z. For example, a survey
might have "did not apply", "refused to answer" and "the
reason why this one is missing is itself missing" and you could
code those three situations as .a, .b and
the .c. But since .a is not the same
as the generic .,
the condition rep78!=. will not exclude
the .a's, .b's,
etc. Veteran Stata programmers would thus write:
l make rep78 if rep78>3 & rep78<.
.a is not the same as .,
but they're both coded as +infinity. Only actual (non-missing)
values will meet the condition rep78<..
This way you don't have to worry about the possibility that
a .a might wreck your code.
Explicit Indexes
Stata has several powerful tools for accessing specific observations.
First of all, most Stata commands
are actually loops. The list command,
for example, lists the first observation, then the second observation,
then the third, etc. As Stata is performing this loop, it keeps
track of which observation it's working on in a variable called
_n. You're also welcome to use it
in your code. For example, type:
l if _n==5
This lists the fifth observation.
Which is the least fuel efficient car? Type the following:
sort mpg
l make mpg if _n==1
No particular surprise there. (There were no Hummers in 1978.)
How about the most efficient? Now you want the last observation.
Stata has another internal variable called _N.
It contains either, depending on how you want to think about
it, the number of observations in the data set or the observation
number of the last observation--it's the same number either way.
(Note to veteran programmers: Stata's observation numbers start
with one, not zero as you may be used to for arrays.) Thus:
l make mpg if _n==_N
gives you the most efficient car (which is, again, no particular
surprise--unless you were expecting today's hybrids to get better
mileage than the top car in 1978). You could also
sort in descending order using gsort (think
generalized sort). gsort
works just like sort, except if
you put a minus sign in front of a variable it will be sorted
in descending order. Thus you could have typed:
gsort -mpg
l make mpg if _n==1
and gotten the same result.
Now, so far you could have done all
this this with in (try it!). But explicit
indexes can do much more. For one thing, _n and _N take
into account by: groups, while by: and in can't
be combined. Suppose you wanted to know the most fuel efficient
domestic car and the most fuel efficient foreign car. All you
have to do is type:
sort foreign mpg
by foreign: l make mpg if _n==_N
In order to use by foreign: the
data had to be sorted by foreign,
and to get the result you wanted each type of car had to be
sorted by mpg.
When you started your command with by
foreign:,
Stata split the data set in two and each group had its own
value of _N. _n also
starts over from one when the command goes from one by: group
to the next. Your sort guaranteed that in both groups the
most fuel efficient car was last, or in other words that it had _n==_N.
A challenge for you: in this data set we hope
that the make
variable uniquely identifies a car, and most data sets have such
a putatively unique identifier. The tools we've just learned allow
us to make sure this is true with just one command--see if you
can think of it.
The answer combines assert, by:
and _N (think some more and
see if you can get it now). The solution may be tricky to understand
but it's easy to use:
bysort make: assert _N==1
bysort
make: splits the data set into a separate group for each
value of make. The assert
then checks that the size of each of these groups is 1. If there
were two cars with the same value of make they'd
be in the same by: group and that group would have a size
of two (and thus _N==2). If every by: group has just one
observation in it (_N==1) then make is
a unique identifier.
(You could also check this by typing duplicates
report make and
making sure there are no surplus observations. The duplicates command
can do a lot of other things too, but the goal here is to learn
how to do useful things with by: and _N.)
That's not all you can do with explicit indexes. Type:
di make[1]
di is short
for display, and simply displays
something on the screen (it's also one of the few commands that
doesn't get executed once for each observation). make[1]
means the value of make for observation
1. Try displaying make[5], make[10],
as many as you like.
You may have noticed that the mpg data
is all integers. That means that while in theory mpg is
a continuous variable, in reality many cars have the same value
of mpg. Let's
make a list of all the cars that have the same value of mpg as
some other car:
sort mpg
l make mpg if mpg==mpg[_n-1] | mpg==mpg[_n+1]
How did this work? By sorting the data by mpg,
we put all the cars with the same value of mpg next
to each other. Thus to check if a given car shares a value of
mpg we only have to look at the car before it and the car after
it and see if their values of mpg are the same.
Since _n is
the observation Stata is currently working on, _n-1 means
the observation before the one Stata is currently working on
and _n+1 means the observation after.
However, note that for the first observation mpg[_n-1] is
missing, and for the last observation mpg[_n+1] is
missing. If we had a car with a missing value for mpg it
would have been sorted to the end (remember missing is +infinity)
and then shown up in the list even if no other car was missing mpg, because
for it mpg==mpg[_n+1] would
be true. We could add some code to handle this situation, but
remember we checked and know that mpg is
never missing so we don't have to worry about it.
While this example is fairly silly, you'll find plenty of uses
for this ability to access observations other than the one being
worked on.
Commands to Examine
Data
You now have a solid understanding of the Stata syntax, however,
to do anything useful you'll need to know more commands than
just list. We'll start with more
commands for examining data.
describe (d)
is a good place to start whenever you open a new or unfamiliar
data set. It will give you information like the number of observations
and variables, the size of the data set in memory, plus list
of the variables it has and their types along with any labels
describing them. Especially watch out for value labels.
summarize (sum)
gives you summary statistics. If you just type:
sum
you will get basic summary statistics for all the variables in
your data set (no varlist
usually means the command should act on all the variables). Note
that there is nothing for make:
it is a string variable so means and such don't make sense. The details
(d) option will give more information.
Try:
sum mpg, d
tabulate (tab)
will create tables of frequencies. It requires a varlist
of either one or two variables. Try:
tab rep78
tab foreign rep78
tab rep78 foreign
to get an idea of what tab does.
Tables are usually easier to read if the variable with the most
unique values comes first, so they're listed vertically. There
are are limits though. Try:
tab weight foreign
Because weight is a quasi-continuous
variable it has too many unique values for this table to be
useful. (tab foreign weight would be even uglier.)
The tab command won't accept more
than two variables, but you can create three-way or higher tables
by combining tab with by:.
bysort foreign: tab mpg rep78
If you want to create one-way tables for multiple variables,
use tab1:
tab1 mpg foreign
Since tab1 only does one-way tables
it won't interpret this as a request for a two-way table like
tab would. In fact tab1 will
take any number of variables and create one-way tables for them
all.
tab has an option called sum which
is similar to the sum command. Try
tab foreign, sum(mpg)
This gives summary statistics of mpg for
the foreign and domestic cars in addition to their frequencies.
There's also a chi2 option that
runs a chi-squared test on a two-way table:
tab rep78 foreign, chi2
Renaming and Labeling Variables
The next commands we'll learn actually make changes to the data
set. We'll continue working interactively as before because it's
the best way to learn, but I want to emphasize that for actual
research you should only change your data set using do files.
Remember, there's no "undo" in Stata.
It amazes and dismays me to see some of the gibberish that
researchers often use for variable names (H2V06 and the like).
Survey makers with 10,000 variables may not have a choice, but
once you're down to the dozen variables you'll actually
use you're not stuck using the names they gave you. Renaming
variables to something that is meaningful takes a bit of time,
but will save a great deal of time and confusion down the road.
Variable names can be up to 32 characters long.
The names in
the auto data set are generally good, but rep78 doesn't
mean much on its own so let's change it to something more clear. rename (ren)
will do the job:
ren rep78 repairRecord
Note that variable names must be one word with no spaces. However
capitalization can make them more readable. For comparison try
to interpret numinhh vs. numInHH
(number in household). Another alternative is to use the underscore
(_) as a space: num_in_hh. Personally
I get tired of reaching for the underscore key, but it's a matter
of taste.
The proper length for variable names is also a matter of taste. repairRecord
is very clear, but it's fairly long. rep78
is much shorter, but doesn't really tell you what it means if you
don't already know. In general the more often you use a variable
the shorter it's name should be, both to save typing
and because it will be familiar anyway.
Labels
There is a partial substitute for long variable names: Stata allows
you to define labels for variables and values that appear in
output. You only have to type them once, so they can be as long
as you want (though if they're too long they'll make your output
ugly).
This data set already has a good set of labels. Type d
to see them. Then change the label of the data set itself
by typing:
label data "1978 Automobile Data that
came with Stata"
The label on foreign is a bit misleading, so
change it too:
label variable foreign "Car Origin"
Type d again to see the results.
Next let's explore value labels by labeling the values of the
repairRecord variable (formerly known
as rep78). A value label
is a mapping from a set of numbers to a set of descriptions.
First you must create the map. Type the following:
label define repRec 1 "Very Bad" 2 "Bad" 3 "Average" 4 "Good" 5 "Very
Good"
Then you need to tell Stata to label the values of the repairRecord variable
using the repRec mapping you just
created:
label values repairRecord repRec
To see the results, type:
tab repairRecord
Two final
commands for labels: label dir
gives you a list of all the defined labels, and label
list tells you what they mean.
Once a map is defined, there's no limit to the number of variables
you can apply it to. Suppose you're working with survey data
and your variables include the gender of the respondent, the
gender of the respondent's spouse, and the genders of all the
respondent's children. You could define just one map called gender and
then use it to label the values of all the gender variables.
Creating and Modifying
Variables
generate (gen)
creates new variables. The general syntax is simply:
gen newVariable=some expression
As an example,
create a variable giving the car prices in 2008 dollars.
To convert 1978 dollars to 2008 dollars you need to multiply
the 1978 price by about 3.3:
gen price2008=price*3.3
Type:
sum price price2008, d
to see summary statistics of the results.
replace changes existing variables,
but the syntax is identical to gen.
Let's be a bit more precise: the conversion factor is actually
closer to 3.31, so change price2008 accordingly:
replace price2008=price*3.31
Note that there is no abbreviation for replace.
Commands that could destroy
data never have abbreviations.
Both gen and replace can
be used with if. When you make a new
variable with gen it is created for
every observation, but where the if condition is not true
for a particular observation that observation gets a missing
value for the new variable. With replace,
where the if condition
is not true the value of the variable is left unchanged.
Suppose you wanted to collapse the five-point scale of the repairRecord variable
(formerly rep78) into a three-point
scale. Here's one way to do it:
gen rep3=1 if repairRecord<3
replace rep3=2 if repairRecord==3
replace rep3=3 if repairRecord>3 & repairRecord<.
The first line creates the new variable, but only sets it to one
for cases where repairRecord is less
than three. The
others get missing. The second line changes some of those missings
to twos, and the third changes more
of them to threes. Note how the third line specifically excludes
observations where repairRecord is
missing. What will the value of rep3 end
up being for those cases? Missing, simply because it was never
set to anything else.
(There is a recode command which
can do this particular task more compactly, but the real goal
here is to learn how to use gen and replace.)
One common recoding task is turning a categorical variable into
a set of indicator variables, but tab has
a gen option that can
do that for you. Type:
tab repairRecord, gen(repair)
Now type d to see what you've done.
Note that it even makes labels!
The egen command, short for extended
generate, gives you access to a large library of functions--type
help egen for a full list. With standard
generate you have to (or get to,
depending on your point of view) specify exactly what the new
variable should be equal to. With egen you simply choose the
function that meets your needs. If there isn't one, you're back
to generate.
Suppose you wanted to find the mean value of mpg,
calculated separately for the foreign cars and the domestic cars
for practice with by:. egen has
a mean function which will give you
exactly what you want:
by foreign: egen meanMPG=mean(mpg)
l make foreign mpg meanMPG
But what if for some odd reason you wanted to create halfMeanMPG equal
to the mean divided by two? What you can't
do is add that to the egen command:
by foreign: egen halfMeanMPG=mean(mpg)/2
The error message you'll get is confusing, but the real problem
is that all egen can do is set a
variable equal to the result of a single function, not an
expression like (like
mean/2). If you really wanted to
divide the mean by two you'd have to type:
by foreign: egen halfMeanMPG=mean(mpg)
replace halfMeanMPG=halfMeanMPG/2
One trick that can be very handy is setting a variable equal to
a condition. Stata has no boolean (true or false) variable type.
Instead it uses numbers. Zero is always false. If you're testing
to see if something is true, Stata will interpret anything but
zero as true. But if you're setting a variable equal to a condition,
Stata will set the variable to one if the condition is true and
zero if it is false.
For example, let's create an indicator variable called gasGuzzler
that is one for cars that get less than average gas mileage and
zero for others:
gen gasGuzzler=(mpg<meanMPG)
If the condition is true, gasGuzzler
will be one. If it is false, gasGuzzler
will be zero. We can then list all the gas guzzlers by typing:
l if gasGuzzler
This is equivalent to
l if gasGuzzler==1
but more natural. You can do the same thing with gender variables:
if you've got gender encoded as male=0 and female=1, consider calling
the variable female rather than sex
or gender. Then you can write commands
that end in conditions like if female
instead of writing if sex==1 and
having to remember what that means.
Let's do the ultimate modification of a variable:
drop MeanMPG
This eliminates the variable MeanMPG from our data set. We can
also eliminate observations, for example:
drop if gasGuzzler
gets rid of all gas guzzlers (just doing our bit to fight global
warming).
keep does the same thing, but in
the opposite fashion. keep MeanMPG
would get rid of all variables but MeanMPG,
while keep if gasGuzzler would get
rid of all the fuel-efficient cars.
Saving Data Sets
Finally, if you were doing anything important you'd need to save
your work. Just type save filename
to create a new Stata data set containing the data that is currently
in memory. If this file already exists Stata will refuse to overwrite
it unless you use the replace
option, so in do files this command
usually looks like:
save filename,
replace.
Note that if you do not specify an extension, Stata will add
.dta by default (which is what you
want).
Never save over the data set you loaded. If you
do, you can never rerun the do file (at least not in the same
way) because the original input is gone.
Analysis
Stata has many, many commands for doing all sorts of statistical
analysis. But they've also worked very hard to make them all as
similar as possible. So while we're just going to do a simple linear
regression, the syntax is almost identical for a broad range of
more complex models.
Since we've made rather a mess of this data set, reload the original
by typing:
use auto, replace
Now let's see how much consumers are willing to pay for good gas
mileage using a simple, naive, hedonic pricing model. Whether
a car is foreign or domestic seems to be important, so throw
that in too. Type:
regress price mpg foreign
This regresses price on mpg
and foreign. Note that regress
takes a varlist, just like any other
command, but it uses it in a particular way. The first variable
is the dependent variable, and it is regressed on all the others
in the list plus a constant (unless you add the noconstant
option). The results suggest that American
consumers dislike fuel efficiency, and will pay to avoid it!
Like any good researcher, when our empirical results contradict
our theory we look for better empirical results. We just might
have some missing variable bias here; in particular it's probably
important to control for the size of the car. Looking
over the variables we
see lots of variables related to
size. You could throw them all in, but they're probably highly
correlated and you don't want to introduce collinearity. Check
using the correlate
(corr) command. Type (note that
this is a great time to use the
window to enter variable names by clicking on them):
corr weight length displacement trunk headroom
While all the variables are positively correlated, weight,
trunk, and headroom aren't
too bad so go ahead and add all three:
reg price mpg foreign weight trunk headroom
Now mpg is insignificant but weight
is highly significant. Looks like Americans like big cars and don't
care about fuel efficiency. That I'll believe.
Graphs
Stata has a suite of tools for creating publication-quality graphs.
Graphs are inherently complicated objects and the syntax for
creating them can also get quite complicated. However, simple
graphs with the default settings are very easy to make. For example,
to make a scatterplot of mpg versus weight, simply type:
scatter mpg weight
If you want a line graph instead, type:
line mpg weight, sort
The sort option here does not mean
Stata should sort the data. Rather it means that the
line should be drawn from the observation with the smallest value
of weight to the observation with
the next smallest, etc. Without it the line would be drawn from
observation one to observation two to observation three and so
forth, and the result would look like a scribble (try it).
The easiest way to keep track of the many settings and details
involved in creating a graph is to use the point-and-click graphical
user interface. Stata will translate what you choose into a
Stata command which you can rerun, put into a do file, or modify.
Start by clicking , (twoway meaning a graph that
has an X and a Y). Then click the button
to create a new graph.
You'll then get a window where you can choose the basic properties
of your graph. Leave the category set to , set the
type to and choose or type mpg as
the and weight as the
. Check the box that says . Then click .
This will take you back to the main graphics Window. You could
click again to add another
graph which would be overlaid on the line graph you already defined.
But there are several other tabs that control the properties
of the graph.
Select the tab and you can
choose which observations are to be included. Type price<10000 in
the
box (note that you don't have
to type the word if).
Next click on the tab. Check
the box
and
for choose
or type foreign.

Click , and the graph will be
created. The command for creating it will also be placed in the
results window:
twoway (line mpg weight, sort) if price<10000, by(foreign)
Note how for
graphs by is
an option, not a prefix like you're used to. That's because you're
not creating two completey separate graphs for the domestic and
foreign cars like you would with the standard by:. Instead
you're creating one graph with the two subpopulations next to
each other.
If you click , again the same settings will still be there so you
can refine the options you chose and try again. Once you've
got the graph you want, copy the resulting command into a
do file. If you want to start a new graph instead, click
on the large (reset) button in
the lower left of the window.
For much more information about creating graphs, see An
Introduction to Stata Graphics.
Do Files
You now know how to construct useful commands from the components
of Stata syntax. Next it's time to learn how to organize those
commands into do files.
Do files are simply text files whose names end with .do and
which contain Stata commands exactly the way you'd type them
into the command window. Since they are plain text you can use
any text editor you prefer, including TextPad, emacs, vi, or
even Notepad. Stata includes a simple text editor, very similar
to Notepad (but it's also available in Linux). If you need to
run do files on a Linux server but don't know any Linux text
editors, take a look at Running
Linux Programs Using Windows (Mostly).
To make a do file, open a text editor and start typing Stata commands,
pressing Enter at the end of each
one. Then save it as filename.do.
That's it.
Logs
Every do file should have a corresponding log file which records
what actually happens when the do file ran. If you run your do
file in batch mode, reading the log is the only way you'll get
your results. To start logging the command is:
log using filename.log,
replace
where filename is the name of the
file you want Stata to use as a log. All commands and their output
will be saved in that file. The replace option
tells Stata that if a log file with that name already exists,
say from a previous attempt to run the program, it should be
replaced by the current log.
Note that if you do not specify the .log at
the end of the filename, Stata will save the log using its Stata
Markup and Control Language. SMCL has its uses, but it can only
be read by Stata's Viewer. If your filename ends with .log,
Stata will save the log as plain text which you can read in any
text editor.
When your are done with everything that needs to be recorded,
type
log close
Comments
Comments are bits of text included in a do file for the benefit
of human readers, not for Stata. When Stata sees the characters /* it
will ignore everything that follows until it sees */.
Comments should explain what the code is doing and why, and if
anyone else ever needs to read and understand your code, good
comments are invaluable. But the most likely beneficiary is yourself:
in six months (let alone ten years) your code might as well have
been written by someone else.
You don't need to comment every line of code--most Stata
code is fairly easy to read. But be sure to comment anything
that required some particular cleverness on your part.
Writing a Do File
Let's write an actual do file. Open your favorite text editor--if
you don't have one we suggest TextPad on Windows. Save the blank
document in a convenient location (perhaps your drive)
as stataintro.do so your editor will
know it's a Stata do file (TextPad and emacs will color it accordingly).
Then type something along the lines of:
log using stataintro.log, replace
set mem 5m
use "c:\program files\stata10\auto", replace
/* That's the Windows path--if you're using Linux replace it
with /software/stata/auto */
/* Some things you could do with this data--feel free to make
up your own */
sort foreign mpg
by foreign: l make mpg if _n==1 | _n==_N
gen greatCar=(rep78>3 & mpg>25)
logit greatCar foreign price
log close
Save it when you're done.
Running a Do File
To run your do file, go back to Stata. First you need to change
to the proper directory--the one where you saved the do file--using cd.
If you put it directly in your drive
the command would be
cd U:
Then actually run it by typing do and
then the filename.
do stataintro
Stata will assume that the filename ends with .do.
You'll then see all your results. If the do file doesn't run properly
you'll need to make changes and run it again, but read the next
section before doing so. Also open the log file in your text editor
so you can see what it contains.
Running Do files in Interactive Stata
If you are using Windows Stata or an interactive Stata session
in Linux there are some additional commands you'll want to add to
the beginning of your do file. First off, you don't want to have
to sit there and press the space bar every time the window
fills up and Stata says .
You can prevent that by putting
set more off
at the beginning.
Then you want to make sure that whatever happened
before your do file was run doesn't cause problems. You can get
rid of any previous data in memory by adding
clear
but there could also be an open log file. One common scenario
(in fact you may be experiencing it right now) is where a do
file opens a log but crashes before closing it. The log thus
remains open. If you fix the problem with the do file and then
try to rerun it it will crash again because it can't
open a new log. You can fix that by typing log
close,
but that only works if a log file is actually open--otherwise
your do file crashes again. The solution is to use capture:
capture log close
The capture command allows your do file to proceed even if the
following log close command generates
an error because no log was open.
These three commands should probably be the first three lines
of any do file meant to run in interactive Stata:
set more off
clear
capture log close
Running Do files in Batch Mode
In Linux you can submit a job to Stata in batch mode. Batch
mode Stata doesn't waste CPU time drawing windows or putting
results on the screen. It simply starts up, runs your do file,
and quits when it is done without any further intervention. You
then get the results by opening the log file. To run a do file in
batch mode, type:
stata -b do filename
at the Linux command prompt. Note that if you plan to run a do
file in batch mode there's no need for the additional commands described
in the previous section (though they won't hurt). If your do file
will take more than a few minutes to run, consider submitting it
to Condor by logging into Kite and typing:
condor_stata -b do filename
The SSCC has a tremendous amount of computing power available
through our Condor flock. See An Introduction
to Condor for more information.
Windows Stata does not have a batch mode. However, you can prevent
Stata from wasting CPU time updating the Results window by putting
it in the background.
Organizing Your Research Project
Now we'll go the opposite direction: taking a research project
and breaking it into do files. Consistently following a few best
practices can save you a tremendous amount of time and headaches,
and reduce the probability of making serious mistakes.
In a typical situation you have a research question you want to
answer and some data that you think will answer it, but the data
isn't in a form that can actually answer the question--yet.
Begin with the End in Mind
The first thing you should do is figure out what form the data
will need to be in in order to be useful. What should an observation
represent? What variables will each observation need to contain?
The answers to these questions will most likely be determined
by the statistical techniques you plan to use. But planning it
out ahead of time will prevent you from spending time manipulating
the data in ways that don't actually end up meeting your needs.
Don't Try to do Everything at Once
Once you've got the goal clear in your mind, the last thing
you should do is sit down and write one massive do file that
gets you there in one step, only trying to run it once it's "done."
First of all this is a recipe for frustration, as the result
will most likely be a massive number of bugs. Even worse, you
may find that in order to make the early parts work you'll need
to do something in a different way than you originally thought.
You'll then have to change everything that follows.
It's far better to write a bit of code, test and debug it, then
write a little more, test and debug it. But then
you end up rerunning the old code you know is good every time
you want to test the new code that may be bad. The solution is
to break up your project into multiple do files. That way you
only need to rerun the part that you're currently working on.
Never Write your Output over your Input
Most do files you'll write will start with some input data
file, do things with it, and save the result. However you
should never have a do file save its output over its input. If
you do, you can never run that do file again because the input
it was written to process is now gone. If it turns out that the
do file contained an error, you may be reduced to asking the
consultant to restore your input
data from the SSCC's backup tapes.
Make your Workflow Reproducible
On the other hand, if you plan your workflow properly you can
recreate your entire project at will.
Start with the data as you obtained it. Your first
do file will read it in, make some changes, and save the results
in a separate file. Your second do file will read in the output
from the first do file, make further changes, and then save its
results in another separate file. Repeat until your project is
done. If your data files are large, you can delete all but the
original and the input data for the do file you're currently
working on. If you follow this procedure you can recreate everything
you've done at will just by rerunning all your do files. (It's
also a good idea to make a "readme" file
for each project with information like what order
its do files must be run in.)
This method is also extremely helpful for debugging. If you discover
a problem you can go back through your do files one by one until
you find the error, fix it, and then rerun the corrected code
for your entire project with just a few commands. It's also easy
to make changes.
Learning More
Congratulations, you now know enough Stata to get you started.
There's a great deal we haven't covered, of course, but Stata
has excellent tools for learning more.
Your first resource is the Stata help files, which contain the
bulk of the printed documentation. To see the help for a particular
command type help command,
e.g.
help egen
You'll get a syntax diagram, a brief explanation of the various
options, and even examples.
However, you'll very often know what you want to do but not the
name of the command that will do it. Then findit is
your best bet. For example, suppose you want to do something
with Heckman selection models. If you type
findit heckman
you'll get a tremendous amount of information. First Stata will
search the help files and point out that there is a heckman command,
along with related commands like treatreg.
Then it will search the Frequently Asked Questions files on Stata's
web site (and the large statistical web site at UCLA). Finally
it will search through the user-written programs that have appeared
in the Stata Journal, the old Stata Technical Bulletins, or in
the Boston College Statistical Software Components archive. You
can download and install these programs right from the Stata viewer
and use them immediately.
Next, the manuals are
excellent. They are available for short-term checkout in the
CDE Library or for reference in the 4218 computer lab and can
be purchased through the Stata's
"GradPlan" at reasonable price and
great speed. The User's Guide is the first place to look for
general concepts, but the Reference books are the place to go
for help using specific commands and estimators.
The SSCC's
publication collection has a large section on Stata, including
general guides like this one, An Introduction
to Stata Graphics or Programming
in Stata, plus discussions of specific topics like Bootstrapping
in Stata or Using Stata Graphs in Documents.
We also offer classes on Stata each semester--see the training
web page for details and to register.
Finally, the SSCC consultant is
available to help. We cannot write your Stata programs for you.
But we will be more than happy to help with planning your project,
figuring out the commands that will make your program
work, and of course finding and fixing bugs.
|