Stata for Students: Creating Variables

This article is part of the Stata for Students series. If you are new to Stata we strongly recommend reading all the articles in the Stata Basics section.

In this article you'll learn how to create new variables and change existing variables.

Assuming you created an SFS folder while reading Managing Stata Files, go to that folder and create a new do file called newvars.do. Start with the usual setting up (see Doing Your Work Using Do Files):

capture log close
log using newvars.log, replace

clear all
set more off

use gss_sample

// do work here

log close

Generate and Replace

You create a new variable in Stata using the generate command, usually abbreviated gen. You can change the value of an existing variable using replace. Since replace can destroy data, it has no abbreviation. The basic syntax is the same for both commands:

gen variable=something
replace variable=something

The something you're setting the variable to will be the result of some math, but it can be really simple math, like a single number.

The gen and replace commands will often have if conditions. With gen, an observation that doesn't meet the if condition will not get a value for the new variable—it will be missing instead. With replace, an observation that doesn't meet the if condition is left unchanged.

Recoding with Generate and Replace

Let's turn the educ variable ("HIGHEST YEAR OF SCHOOL COMPLETED") into a categorical variable edu_cat, with the categories "Less than High School", "High School", "Some College", "Bachelors", and "Advanced". You'll need an initial gen command to create the new variable and handle one of the categories, and a replace command for each of the remaining categories. if conditions ensure that each respondent gets the right value of edu_cat based on its value of educ.

gen edu_cat=1 if educ<12
replace edu_cat=2 if educ==12
replace edu_cat=3 if educ>12 & educ<16
replace edu_cat=4 if educ==16
replace edu_cat=5 if educ>16 & educ<.

If educ is missing then edu_cat will be missing because a missing value does not meet any of the if conditions in these commands.

The Recode Command

You can also do this task using the recode command, which is easier to use but not as flexible as gen and replace. With recode you specify a list of rules in the form (old values=new value). The old values can be a single number, a list of numbers, or a range of numbers which you describe with start/end:

recode educ (0/11=1) (12=2) (13 14 15=3) (16=4) (17/20=5) ///
, gen(edu_cat2)

The gen option tells recode to create a new variable (edu_cat2) to store the results. If you don't include a gen option, recode will change the original variable. edu_cat2 will be missing if educ is missing because none of the recode rules say to change missing values to anything else.

Labels

Variable labels can tell you more about the variable itself, like the actual question asked. You can set them with the label variable command:

label variable edu_cat "Education Category"

Value labels tell you what the individual values of the variable mean. To set them, you first define the labels and then apply them to a variable:

label define edcats 1 "Less than HS" 2 "HS" 3 "Some College" ///
4 "Bachelors" 5 "Advanced"
label values edu_cat edcats

If you look in the data browser at the edu_cat variable after running these commands, you'll see the text labels rather than the raw numbers.

Rename

You can change the name of a variable with the rename command. Changing meaningless variable names (like Q26 for "answer to question 26") to descriptive variable names can make it much easier to keep track of your variables. The GSS uses descriptive variable names, which is good, but they're very short and sometimes cryptic. For example, age of the respondent at the time their first child was born is agekdbrn. You could change it with:

rename agekdbrn age_at_1st_birth

Indicator (Binary) Variables

A variable can be set to the result of a condition. If the condition is true the variable will get a 1, and if it is false the variable will get a 0. This makes it very easy to create indicator or binary variables, which tell you if something is true or not.

Consider the questions bigbang ("THE UNIVERSE BEGAN WITH A HUGE EXPLOSION"), electron ("ELECTRONS ARE SMALLER THAN ATOMS"), and laser ("LASERS WORK BY FOCUSING SOUND WAVES"). They are true/false questions, with "true" coded as 1 and "false" coded as 2. They are designed to measure the scientific knowledge of the respondent. The correct answer to the first two questions is "true," but the correct answer to the third question is "false" (lasers are coherent light, not focused sound). Many respondents have missing values for these questions.

You can create corresponding indicator variables for "respondent got this question right" with the following commands:

gen bigbang_right=(bigbang==1) if bigbang<.
gen electron_right=(electron==1) if electron<.
gen laser_right=(laser==2) if laser<.

The new bigbang_right variable will get a 1 if bigbang is 1 (i.e. the condition (bigbang==1) is true). It will get a 0 otherwise, unless bigbang is missing. In that case the if bigbang<. condition takes over and says bigbang_right should be missing. Note that without that if condition, respondents with a missing value for bigbang would get 0 for bigbang_right, as if they had answered the question and gotten it wrong.

The variable names make these variables easy to understand: if bigbang_right is 1, or true, that means "Yes, this person got the big bang question right" while if bigbang_right is 0, or false, that means "No, this person did not get the big bang question right."

Creating a Scale (Index)

Next create a scale or index that measures the respondent's overall scientific knowledge:

gen sci_know=bigbang_right+electron_right+laser_right

The resulting scale will be the number of questions the respondent got right. It will be missing if the respondent didn't answer all of the questions, which is good: again, we do not want to treat not answering a question the same as answering it but getting it wrong.

Variables Based on Statistics

The egen ("extended generate") command lets you create variables containing statistics based on your data. For example:

egen mean_sci_know=mean(sci_know)

This creates a variable containing the mean of sci_know. To see all the things egen can do, type help egen.

Statistics for Groups

If you want to calculate statistics for groups rather than the entire data set, use by to tell Stata to run egen separately for each group.

bysort edu_cat: egen edu_mean_sci_know=mean(sci_know)

This calculates the mean of sci_know for each education category. To see the results, we'll jump ahead a bit and use the tab command with the sum option.

tab edu_cat, sum(mean_sci_know_edu)

The results show that the average of sci_know is higher at higher levels of education, as we'd expect. If you just wanted to see those means, not store them in a variable, you could run:

tab edu_cat, sum(sci_know)

This is an example of Summary Statistics for One Quantitative Variable over One Categorical Variable.

Complete Do File

The following is a complete do file for this section:

capture log close
log using newvars.log, replace

clear all
set more off

use gss2014

gen edu_cat=1 if educ<12
replace edu_cat=2 if educ==12
replace edu_cat=3 if educ>12 & educ<16
replace edu_cat=4 if educ==16
replace edu_cat=5 if educ>16 & educ<.

recode educ (0/11=1) (12=2) (13 14 15=3) (16=4) (17/20=5) ///
, gen(edu_cat2)

label variable edu_cat "Education Category"

label define edcats 1 "Less than HS" 2 "HS" 3 "Some College" ///
4 "Bachelors" 5 "Advanced"
label values edu_cat edcats

rename agekdbrn age_at_1st_birth

gen bigbang_right=(bigbang==1) if bigbang<.
gen electron_right=(electron==1) if electron<.
gen laser_right=(laser==2) if laser<.

gen sci_know=bigbang_right+electron_right+laser_right

egen mean_sci_know=mean(sci_know)

bysort edu_cat: egen mean_sci_know_edu=mean(sci_know)

tab edu_cat, sum(mean_sci_know_edu)
tab edu_cat, sum(sci_know)

log close

Last Revised: 7/18/2016