This is part three of the Stata for Researchers series. For a list of topics covered by this series, see the Introduction. If you're new to Stata we highly recommend reading the articles in order.
Now that you understand Stata's basic syntax, you're ready to start working with your data. This article will teach you how to make new variables, modify existing variables, and create labels.
The basic commands for creating and changing variables are generate (usually abbreviated gen) and replace (which, like other commands that can destroy data, has no abbreviation). gen creates new variables; replace changes the content of existing variables. Their core syntax is identical:
gen variable=expression
or
replace variable=expression
where variable is the name of the variable you want to create or change, and expression is the mathematical expression whose result you want to put in it. Expressions can be as simple as a single number or involve all sorts of complicated functions. Stata has a large library of functions you can use in gen and replace expressions, far too many to discuss here (though we'll introduce some as we go). For a full list, type help functions in Stata--we'll talk more about learning from the help files in the Learning More.
If an expression includes missing values at any point, the result is missing. Usually this is exactly what you'd expect and want.
The prices in the auto data set are in 1978 dollars, so it might be useful to convert them to 2011 dollars. The conversion factor is about 3.5, so you can carry out the conversion with:
gen price2011=price*3.5
You can now compare them with
l make price price2011
But let's be a little more precise and use 3.45 as the conversion factor. You can make the change with:
replace price2011=price*3.45
Given that the data set has 74 observations, the message (74 real changes made) tells us all of them were changed. Once you start including if conditions, how many observations were actually changed can be very useful information.
If a gen command has an if condition, the resulting variable will (and must) still exist for all observations. However it will be assigned a missing value for observations where the if condition is not true. If a replace command has an if condition, observations where the if condition is not true will be left unchanged.
Suppose you wanted to collapse the five-point scale of the rep78 variable into a three-point scale. Here's one way to do it:
gen rep3=1 if rep78<3
replace rep3=2 if rep78==3
replace rep3=3 if rep78>3 & rep78<.
The first line creates the new variable rep3, but only sets it to one for cases where rep78 is less than three. The others get missing. The second line changes some of those missings to twos, and the third changes more of them to threes. Note how the third line specifically excludes observations where rep78 is missing. What will the value of rep3 be for those cases? Missing, because it was never set to anything else. Observations where rep78 is missing were implicitly or explicitly excluded from all three commands. (If you forgot to exclude missing values from the last command, then rep3 would be three for cars where rep78 is missing, an all-too-common mistake.)
The recode command gives you an alternative way of creating rep3. It is designed solely for recoding tasks and is far less flexible than gen and replace. In fact, anything recode can do can also be done with gen and replace, so learning it is optional. But it can do some things more easily. The syntax is:
recode var (rule 1) (rule 2) (more rules as needed...), gen(newvar)
The gen option at the end is not required--if it's not there then the original variable will be changed rather than creating a new variable with the new values. You can also have recode work on a list of variables, recoding them all in the same way.
The core of the recode command is a list of rules, in parentheses, that tell it how a variable is to be recoded. They take the form inputValue=outputValue. The inputValue can be a single number, a list of numbers, or a range of numbers (start/end, just like with in). outputValue will always be a single number. Anything not covered by a rule is left unchanged. Here's the recode version of converting rep78 to a three-point scale:
recode rep78 (1 2=1) (3=2) (4/5=3), gen(rep3b)
Incidentally, the only reason for listing 1 and 2 but giving a range for 4 through 5 was to demonstrate both styles. Given these data they are equivalent. Missing values required no special handling: since missing was not listed in the input values of any rule observations with missing values remain unchanged.
In creating indicator variables, you can take advantage of the fact that Stata treats true as one and false as zero by setting a variable equal to a condition. Consider:
gen gasGuzzler=(mpg<20)
(The parentheses are optional, but make it easier to read.) This creates an indicator variable called gasGuzzler which is one (true) for cars where mpg is less than twenty and zero (false) where mpg is greater than or equal to twenty. You can see the effect with:
l make mpg if gasGuzzler
The egen command, short for extended generate, gives you access to another library of functions--type help egen for a full list. With egen you set your new variable equal to the result of one of these functions; you cannot use them in expressions. egen functions tend to be more complex, and often work across observations.
Suppose you wanted to find the mean value of mpg and store it in a variable. egen has a mean function which will give you exactly what you want:
egen meanMPG=mean(mpg)
l make mpg meanMPG
But what if for some odd reason you wanted to create halfMeanMPG, equal to the mean divided by two? What you can't do divide the result of the mean function by two in the egen command:
egen halfMeanMPG=mean(mpg)/2
The error message you'll get is confusing, but the real problem is that egen sets a variable equal to the result of a single function, not an expression (like mean(mpg)/2). To get the mean divided by two you'd have to type:
egen halfMeanMPG=mean(mpg)
replace halfMeanMPG=halfMeanMPG/2
The mean() function finds the mean of a column in your data set. If you need to find the mean of a row, or, more likely, part of a row, you'd use rowmean():
egen junkvar=rowmean(weight price)
For each car, junkvar will contain the mean of that car's weight and price (a meaningless number, hence the name).
The egen library includes functions for calculating most summary statistics either by row or by column. Some other particularly useful ones include total(), which calculates the total of a variable, rowtotal(), which adds up variables across a row, and group() which will group your observations based on the values of the variable(s) you list and assign each group a number.
The egen functions generally handle missing values by calculating the result across whatever data are available. Thus while:
gen sum1=mpg+rep78
will set sum1 to missing wherever rep78 is missing, the otherwise equivalent
egen sum2=rowtotal(mpg rep78)
will set sum2 to mpg where rep78 is missing.
Stata can store numbers in five different types of variables. byte, int and long are all integers of various sizes. The smallest, byte, can only store numbers below 100 but takes up very little memory, making it ideal for indicator and categorical variables. int can store numbers up to about 32,000 and long up to about two billion. For numbers with fractions, your choices are float (the default) and double. Both can store very large numbers, but their precision is limited: a float only has about seven digits of accuracy and a double sixteen. For details type help data_types.
While social scientists rarely have seven meaningful digits in their data, you need to keep in mind that identifiers are just big numbers as far as Stata is concerned. For example, you probably think of your UW ID as a string of ten small integers ("nine, zero, two...") but to Stata it's a single, very large number. If you tried to store UW IDs as the default float, they'd be rounded since a float can't store ten digits accurately (which would defeat the purpose of storing an ID). Such IDs should be stored using doubles, longs or strings.
The type of a variable is set when it is created. To create a variable that is not a float, put the desired type right after gen or egen and before the variable name:
gen byte greatCar=(mpg>25 & rep78>3) if rep78<. | mpg<=25
If you declare that a variable is an integer (byte, int or long) but the expression you set it equal to contains fractions, the fractional part will be truncated, not rounded. There is also a round() function if you need it.
Much of the time it's not worth worrying about finding the most efficient numeric types for your variables. However, if your data set is large, using small types like byte where possible can save a lot of memory and disk space. The compress command will look for variables that can be stored in smaller types without losing precision, and will change them automatically.
Strings are variables that contain text rather than numeric values. It's quite possible for that text to be made up of numbers, but Stata will not try to evaluate them. You can recognize a string because it will have quotes around it:
gen x1="123"
makes x1 a string, and is completely different from
gen x2=123
In particular, you can't add x1 and x2. You could write if x1=="123" or even if x1>"123" but the latter would be evaluated according to alphabetical order, not numeric.
Stata noticed that you were setting x1 equal to a string, and thus made x1 a string variable automatically. However:
replace x1=123
or
replace x2="123"
will not work because you can't change a variable from string to numeric or vice versa (though we'll see how to achieve the same effect momentarily).
The drop command allows you to remove either variables or observations from your data set. If you give it a varlist, it will remove those variables:
drop greatCar
removes the variable greatCar from your data set.
If you give it an if condition, drop will remove observations where that condition is true:
drop if gasGuzzler
The keep command works in the same way, but in the opposite sense. keep greatCar would remove all variables except greatCar, while keep if gasGuzzler would remove all observations that are not gas guzzlers.
You can rename a variable by typing:
rename oldName newName
Renaming variables with gibberish names (H2V06 and the like) may take a bit of time, but will save you time in the end.
Variable names must be one word with no spaces. However capitalization can make them more readable. For comparison try to interpret numinhh vs. numInHH (number in household). You can also use the underscore (_) as a space: num_in_hh.
The proper length for variable names is a matter of taste. repairRecord is very clear, but takes a while to type. rep78 is much shorter, but doesn't tell you what it means if you don't already know. In general, the more often you use a variable the shorter its name should be, both to save typing and because it will be familiar anyway.
Now that you know drop and rename, you're ready to change the types of variables. Suppose you needed to do some math with the numbers contained in x1. Right now you can't, because x1 is a string. Here's how you can change that:
gen temp=real(x1)
drop x1
rename temp x1
The real() function takes one argument, a string, and returns that string converted to a number. If the string contains anything but numbers real() will return missing. Having stored the numbers in x1 as temp, you then drop x1 and make temp the new x1. x1 is now a numeric variable. You can turn numeric variables into strings using the same process, but replace the real() function with the string() function.
Labels allow you to convey more information about your data. You only have to type them once, so they can be as long as you want. Labels can be applied to variables or to their values (or entire data sets, which we won't discuss).
This data set already has a good set of variable labels. Type d to see them. The only one that might be confusing is the label on foreign, so we'll change it using the label command. The syntax to change a variable label is:
label variable var "label"
So type:
label variable foreign "Car Origin"
Type d again to see the results.
Next let's explore value labels by labeling the values of the rep78. Value labels are a mapping from a set of integers to a set of descriptions, so the first step is to create the map. To do so, use the label command again, but this time the syntax is:
label define mapName value1 "label1" value2 "label2"...
Thus:
label define rep 1 "Very Bad" 2 "Bad" 3 "Average" 4 "Good" 5 "Very Good"
Then tell Stata to label the values of the rep78 variable using the rep mapping you just defined. The syntax is:
label values variable map
And thus:
label values rep78 rep
To see the results, type:
l rep78
Once a map is defined you can apply it to any number of variables. Suppose you're working with survey data and your variables include the gender of the respondent, the gender of the respondent's spouse, and the genders of all the respondent's children. You could define just one map called gender and then use it to label the values of all the gender variables.
Two final commands for value labels: label dir gives you a list of all the defined labels, and label list tells you what they mean.
Once you've made changes to your data set, you'll want to save them. You can do so with the save command:
save dataset, replace
The replace option tells Stata that if the file already exists it can replace it with the new version.
However, you should never write a do file that saves its output over its input. If you do, 1) you can never run that do file in the same way again, and 2) if you made a mistake you can't undo it. We'll talk more about this in Do Files and Project Management.
If you carried out all the commands listed in this article you'll have made quite a mess of the sample data set. Before doing the exercises reload it with:
sysuse auto, replace
Suppose the cost of manufacturing a car is the sum of the following:
Calculate the profit (price minus cost) from selling each car. (Solution)
Going back to the automobile data set (sysuse auto, replace), the make variable is comprised of the manufacturer of the car followed by the name of the car. Create a new variable containing just the manufacturer. Next turn it into a categorical variable (i.e. a number for each manufacturer). Finally obtain a set of dummies based on that categorical variable. Now extract the manufacturer from make again using a different method. Use the smallest possible variable types at each step.
This exercise will probably require you to type help functions and/or help egen and look through the list of functions available. In fact that's kind of the point: finding functions that will do what you need to do is a big part of Stata programming. Learning More has some tips for understanding the help files. (Solution)
Next: Statistics
Previous: Usage and Syntax
Last Revised: 8/19/2011
