Stata for Researchers: Working With Data

This is part four of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.

Now that you understand Stata's basic syntax, you're ready to start working with your data. This article will teach you how to make new variables, modify existing variables, and create labels.

Generate and Replace

The basic commands for creating and changing variables are generate (usually abbreviated gen) and replace (which, like other commands that can destroy information, has no abbreviation). gen creates new variables; replace changes the values of existing variables. Their core syntax is identical:

gen variable=expression

or

replace variable=expression

where variable is the name of the variable you want to create or change, and expression is the mathematical expression whose result you want to put in it. Expressions can be as simple as a single number or involve all sorts of complicated functions. Stata has a large library of functions you can use in gen and replace expressions, far too many for us to cover them all (though we'll introduce some as we go). For a full list, type help functions—we'll talk about learning from the help files in Learning More.

If an expression includes a missing value at any point, the result is missing. Usually this is exactly what you'd expect and want.

You should never change data interactively, so before proceeding create a do file. Open the do file editor (type doedit or click on the button that looks like a pencil writing in a notebook). Immediately click File, Save as... and save the do file as data1 (Stata will fill in .do at the end). Then start your do file with the following commands, as described in the previous section:

clear all
capture log close
set more off
log using data.log, replace

use auto

//real work goes here

save autoVersion2,replace
log close

Now you're ready to go to work. The work itself will go after use auto and before log close.

The prices in the auto data set are in 1978 dollars, so it might be useful to convert them to 2015 dollars. To do so you need to multiply the prices by a conversion factor which is the CPI in 2015 divided by the CPI in 1978, or about 3.6. The code will be:

gen price2015=price*3.6

You can now compare them with:

l make price*

Add these lines to your do file, then click the Do button on the far right (the one that looks like sheet of paper with a little "play" symbol in the corner) or press Ctrl-d. Switch to the main Stata window to see the results.

But let's be a little more precise and use 3.57 as the conversion factor. Go back to the do file and change the line:

gen price2015=price*3.6

to:

gen price2015=price*3.57

and run the do file again. The do file starts by clearing the previous data set from memory and loading the original from disk, so there's no need to "fix" the old version of price2015. It's simply created again the way we now want it.

Having both a price and a price2015 variable allowed us to compare their values and check for plausibility. But if you only want to work with 2015 dollars and are confident you've got the formula right, you can use the replace command to change the existing price variable instead of creating a new one:

replace price=price*3.57

Run this version and you'll get the message (74 real changes made). Given that the data set has 74 observations, this tells us all of them were changed as we'd expect. Once you start including if conditions, how many observations were actually changed can be very useful information.

If a gen command has an if condition, the resulting variable will (and must) still exist for all observations. However it will be assigned a missing value for observations where the if condition is not true. If a replace command has an if condition, observations where the if condition is not true will be left unchanged.

Suppose you wanted to collapse the five-point scale of the rep78 variable into a three-point scale. Add the following code to your do file to do so:

gen rep3=1 if rep78<3
replace rep3=2 if rep78==3
replace rep3=3 if rep78>3 & rep78<.

The first line creates the new variable rep3, but only sets it to one for cases where rep78 is less than three. The others get missing. The second line changes some of those missings to twos, and the third changes more of them to threes. Note how the third line specifically excludes observations where rep78 is missing. What will the value of rep3 be for those cases? Missing, because it was never set to anything else. Observations where rep78 is missing were implicitly or explicitly excluded from all three commands. (If you forgot to exclude missing values from the last command, then rep3 would be three for cars where rep78 is missing, an all-too-common mistake.)

Recode

The recode command gives you an alternative way of creating rep3. It is designed solely for recoding tasks and is much less flexible than gen and replace. In fact, anything recode can do can also be done with gen and replace, so learning it is optional. But it can do some things more easily. The syntax is:

recode var (rule 1) (rule 2) (more rules as needed...), gen(newvar)

The gen option at the end is not required—if it's not there then the original variable will be changed rather than creating a new variable with the new values. You can also have recode work on a list of variables, recoding them all in the same way.

The core of the recode command is a list of rules, in parentheses, that tell it how a variable is to be recoded. They take the form (inputValue=outputValue). The inputValue can be a single number, a list of numbers, or a range of numbers specified with start/end. outputValue will always be a single number. Anything not covered by a rule is left unchanged. Here's a recode version of converting rep78 to a three-point scale:

recode rep78 (1 2=1) (3=2) (4/5=3), gen(rep3b)

(The only reason for listing 1 and 2 but giving a range for 4 through 5 was to demonstrate both styles.) Missing values required no special handling: since missing was not listed in the input values of any rule, observations with missing values remain unchanged.

If you did everything correctly, rep3 and rep3b will be identical. Check that with an assert:

assert rep3==rep3b

Run the do file to find out how you did.

Creating Indicator Variables

In creating indicator variables, you can take advantage of the fact that Stata treats true as one and false as zero by setting a variable equal to a condition. Consider:

gen gasGuzzler=(mpg<20)

(The parentheses are optional, but make it easier to read.) This creates an indicator variable called gasGuzzler which is one (true) for cars where mpg is less than twenty and zero (false) where mpg is greater than or equal to twenty. You can see the effect with:

l make mpg if gasGuzzler

We know that no car has a missing value for mpg but, if any did, the above code would assign it a zero for gasGuzzler as if it were known to have good gas mileage. gasGuzzler should be missing for such cases, which you can do with:

gen gasGuzzler=(mpg<20) if mpg<.

Egen

The egen command, short for extended generate, gives you access to another library of functions—type help egen for a full list. egen functions tend to be more complex, and often work across observations.

Suppose you wanted to find the mean value of mpg and store it in a variable. egen has a mean function which will give you exactly what you want:

egen meanMPG=mean(mpg)

The mean() function finds the mean of a column. To find the mean of a row, or, more likely, part of a row, use rowmean():

egen rm=rowmean(mpg rep78)

For each car, rm will contain the mean of that car's mpg and rep78, not that that's likely to be a useful quantity. A more typical use of rowmean() is to construct a respondent's mean response to a group of questions.

The egen functions generally handle missing values by calculating their result across whatever data are available. Thus for observations where rep78 is missing, rm is just mpg. Most of the time that's what you want: if you have 10,000 observations and 1 of them has a missing value you wouldn't want Stata to tell you it's impossible to calculate a mean. However, it can be problematic: if you use rowmean() to calculate mean responses to a group of questions but some people didn't answer all the questions, the result will be the same as if you had filled in all the missing values with the mean—a very bad idea.

With egen, what follows the equals sign is a single function and not a mathematical expression. If you wanted to set a variable to one-half of a mean you'd have to first use egen to calculate the mean and then use a replace command to divide it by 2.

Here are a few of the most commonly used functions in the egen library:

Name Description
min() Minimum value
max() Maximum value
mean() Mean
median() Median
sd() Standard Deviation
total() Total

All of these functions act across observations. The parentheses will usually contain a single variable for the function to act on, but can contain a mathematical expression instead. These functions also have row equivalents (rowmin, rowmax, etc.) that do the same thing but across variables on a single row. There are plenty of other useful egen functions, such as std (create a standardized version of a variable), group (create a group identifier based on the values of one or more categorical variables), or even mtr (marginal tax rate for a married couple in the US with a given amount of income in a given year). You can get a complete list by typing help egen, and you should plan on reading through it some time early in your Stata career.

Drop and Keep

The drop command allows you to remove either variables or observations from your data set. If you give it a varlist, it will remove those variables:

drop rep3b

removes the variable rep3b from your data set.

If you give it an if condition, drop will remove all observations where that condition is true:

drop if gasGuzzler

The keep command works in the same way, but in the opposite sense. keep rep3b would remove all variables except rep3b, while keep if gasGuzzler would remove all observations that are not gas guzzlers.

Rename

You can rename a variable by typing:

rename oldName newName

Renaming variables with gibberish names (H2V06 and the like) may take a bit of time, but will save you time in the end.

Variable names must be one word with no spaces. However, you can use either capital letters or underscores (_) to mark word boundaries. A variable name like numinhh looks like gibberish, but if you put it in the form numInHH or num_in_hh then the reader has a fighting chance of realizing it means "number in household."

The rename command also has the ability to rename large numbers of variables based on patterns. Type help rename for more information.

The variable name rep3 doesn't convey much information, so let's change it:

rename rep3 repairRecord

Labels

Labels allow you to convey more information about your data. You only have to type them once, so they can be as long as you want. Labels can be applied to variables or to their values (or entire data sets, which we won't discuss).

This data set already has a good set of variable labels, as you can see in the Variables window. The only one that might be confusing is the label on foreign, so we'll change it using the label variable command. The syntax to set a variable label is:

label variable var "label"

So type:

label variable foreign "Car Origin"

Look at the Variables window again to see the results.

Next let's explore value labels by labeling the values of the repairRecord (formerly known as rep3, the new variable we recoded to collapse rep78 from a five point scale to a three point scale). Value labels are a mapping from a set of integers to a set of descriptions, so the first step is to create the map. To do so, use the label define command:

label define mapName value1 "label1" value2 "label2"...

Thus:

label define rep 1 "Bad" 2"Average" 3"Good"

Then tell Stata to label the values of the repairRecord variable using the rep mapping you just defined. The syntax is:

label values variable map

And thus:

label values repairRecord rep

To see the results, add:

list repairRecord

Once a map is defined you can apply it to any number of variables: just replace the single variable in the label values command above with a list of variables. Suppose you're working with survey data and your variables include the gender of the respondent, the gender of the respondent's spouse, and the genders of all the respondent's children. You could define just one map called gender and then use it to label the values of all the gender variables.

Two final commands for value labels: label dir gives you a list of all the defined labels, and label list tells you what they mean.

Variable Types and Precision

Stata can store numbers in five different types of variables. byte, int and long are all integers of various sizes. The smallest, byte, can only store numbers below 100 but takes up very little memory, making it ideal for indicator and categorical variables. int can store numbers up to about 32,000 and long up to about two billion. For numbers with fractions, your choices are float (the default) and double. Both can store very large numbers, but their precision is limited: a float only has about seven digits of accuracy and a double sixteen. For details type help data_types.

While social scientists rarely have seven meaningful digits in their data, keep in mind that identifiers are just big numbers as far as Stata is concerned. For example, you probably think of your UW ID as a string of ten small integers ("nine, zero, two...") but to Stata it's a single, very large number. If you tried to store UW IDs as the default float, they'd be rounded since a float can't store ten digits accurately (which would defeat the purpose of storing an ID). Such IDs should be stored using doubles, longs or strings.

The type of a variable is set when it is created. To create a variable that is not a float, put the desired type right after gen or egen and before the variable name:

gen byte highMPG=(mpg>25)

If you declare that a variable is an integer (byte, int or long) but the expression you set it equal to contains fractions, the fractional part will be truncated, not rounded. There is also a round() function if you need it.

Much of the time it's not worth worrying about finding the most efficient numeric types for your variables. However, if your data set is large, using small types like byte where possible can save a lot of memory and disk space. The compress command will look for variables that can be stored in smaller types without losing precision, and will change them automatically.

Strings

Strings are variables that contain text rather than numeric values. It's quite possible for that text to be made up of numbers, but Stata will not try to evaluate them. You can recognize a string because it will have quotes around it:

gen x1="123"

makes x1 a string, and is completely different from

gen x2=123

For example, you can't add x1 and x2. You can write if x1=="123" and even if x1>"123" but the latter will be evaluated according to alphabetical order, not numeric.

Stata noticed that you were setting x1 equal to a string, and thus made x1 a string variable automatically. However:

replace x1=123

or

replace x2="123"

will not work because you can't change a variable from string to numeric or vice versa—but you can make it look like you did.

Changing the Type of a Variable

Suppose you needed to do some math with the numbers contained in x1. Right now you can't, because x1 is a string. Here's how you can change that:

gen temp=real(x1)
drop x1
rename temp x1

The real() function takes one argument, a string, and returns that string converted to a number. If the string contains anything but numbers real() will return missing. Having stored the numbers in x1 as temp, you then drop x1 and make temp the new x1. x1 is now a numeric variable. You can turn numeric variables into strings using the same process—just replace the real() function with the string() function.

Exercises

For the exercises that use the automobile data set, make sure your do file loads the original data set.

  1. Using the automobile data set, suppose the cost of manufacturing a car is the sum of the following:
    • $1.50 per pound of weight
    • $0.25 per pound to ship if it is foreign
    • $100 if its rep78 is 5 (presumably to hire better engineers)

    Calculate the profit (price minus cost) from selling each car. (Solution)

  2. Consider interviews.dta. It contains the month and year in which each subject was born and the month and year in which that subject was interviewed. Find the age of the subject at the time of the interview in months. Then find it in whole years (i.e. what the person would say if you asked "How old are you?"). Don't worry about days (or if you prefer, assume that interviews always occur later in the month than birthdays). For extra credit, read up on Stata dates and repeat the process using them. (Solution)
  3. Going back to the automobile data set, the make variable is comprised of the manufacturer of the car followed by the name of the car. Create a new variable containing just the manufacturer.

    This exercise will probably require you to type help functions and/or help egen and look through the list of functions available. In fact that's kind of the point: finding functions that will do what you need to do is a big part of Stata programming. Learning More has some tips for understanding the help files. (Solution)

  4. Now consider statecounty.dta. It has a two-part identifier: state and county. Combine them into a single identifier such that county 1 in state 1 becomes 101 and county 5 in state 12 becomes 1205. Now do it again by turning them into strings first, making the code for county 1, state 1 "0101". (Solution)
  5. Use the automobile data set again. Create value labels so that when listing mpg you see "12 (Lowest MPG)" and "41 (Highest MPG)" for the cars with the lowest and highest values of mpg respectively. (Solution)

Next: Statistics

Previous: Usage and Syntax

Last Revised: 12/17/2015