*This article is part of the Stata for Students series. If you are new to Stata we strongly recommend reading all the articles in the Stata Basics section.*

Descriptive statistics give you a basic understanding one or more variables and how they relate to each other.

For a variable that describes categories (like sex or race) rather than quantities (like income) frequencies tell you how many observations are in each category. These are examples of univariate statistics, or statistics that describe a single variable.

Categorical variables are also sometimes called factor variables. Indicator variables (also called binary or dummy variables) are just categorical variables with two categories. Frequency tables for a single variable are sometimes called one-way tables.

For a variable that describes quantities (like income) the mean tells you what the expected value of the variable is, and the standard deviation tells you how much it varies. However, the median and percentiles often give you a better sense of how the variable is distributed, especially for variables that are not symmetric (like income, which often has a few very high values). These are also univariate statistics.

Quantitative variables are often called continuous variables. Means are often called averages, and variance is just the standard deviation squared. The median is also the 50th percentile.

For two categorical variables, frequencies tell you how many observations fall in each combination of the two categorical variables (like black women or hispanic men) and can give you a sense of the relationship between the two variables. These are examples of bivariate statistics, or statistics that describe the joint distribution of the two variables.

Tables of frequencies for two variables are often called two-way tables, contingency tables, or crosstabs.

For a quantitative variable and a categorical variable, the mean value of the quantitative variable for those observations that fall in each category of the categorical variable can give you a sense of how the two variables are related. Of then the question of interest is whether the distribution of the quantitative variable is different for different categories. These are also examples of bivariate statistics.

For three or more categorical variables, frequencies will tell you how many observations fall in each combination of the variables and give you a sense of their relationships just like they did with two categorical variables. These are examples of multivariate statistics.

For a quantitative variable and two or more categorical variables, the the mean value of the quantitative variable for those observations in each combination of the categorical variables can give you a sense of how the variables are related just like they did with a quantitative variable and one categorical variable. These are examples of multivariate statistics.

All of these tasks can be carried out using just two Stata commands: tabulate (or tab) and summarize (or sum). Getting them to do all these things is simply a matter of applying Stata syntax, so so if you've read How Stata Commands Work this section will have no surprises for you.

Some commonly used options can change what the tables produced by tab look like, as described in the sections below:

If you plan to carry out the examples in this article, make sure you've downloaded the GSS sample to your U:\SFS folder as described in Managing Stata Files. Then create a do file called descriptives.do in that folder as described in Doing Your Work Using Do Files and start with the following code:

capture log close

log using descriptives.log, replace

clear all

set more off

use gss_sample

// do work here

log close

If you plan on applying what you learn directly to your homework, create a similar do file but have it load the data set used for your assignment.

The tabulate command, or just tab, creates tables of frequencies. To have it give you frequencies for a single categorical variable simply tell it which variable you want it to act on:

tab sex

This produces the following output:

RESPONDENTS | SEX | Freq. Percent Cum. ------------+----------------------------------- male | 110 43.31 43.31 female | 144 56.69 100.00 ------------+----------------------------------- Total | 254 100.00

This tells us that in the GSS sample, 110 of the respondents are male (43.31%) and 144 of the respondents are female (56.69%), for a total of 254 respondents.

The summarize command, or just sum, creates tables of summary statistics. To have it give you summary statistics for a single variable, simply tell it which variable you want it to act on:

sum educ

This produces the following output:

Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- educ | 254 13.38583 3.336343 0 20

This tells us that the mean years of education of the respondents in our GSS sample is 13.4 years.

If you want to see percentiles, add the detail option:

sum educ, detail

This produces the following output:

HIGHEST YEAR OF SCHOOL COMPLETED ------------------------------------------------------------- Percentiles Smallest 1% 5 0 5% 7 2 10% 10 5 Obs 254 25% 12 6 Sum of Wgt. 254 50% 13 Mean 13.38583 Largest Std. Dev. 3.336343 75% 16 20 90% 18 20 Variance 11.13118 95% 19 20 Skewness -.3988663 99% 20 20 Kurtosis 3.899598

The median, 13, is in the table as the 50th percentile. Note how the 25th percentile is 12, just one year lower than the median, while the 75th percentile is four years higher at 16. This shows that the distribution of educ is asymmetric. A histogram (hist educ or hist educ, discrete) would be a good tool to understand its distribution.

To have tab give you the joint distribution of two categorical variables, tell it which two variables you want it to act on:

tab class sex

This produces the following output:

SUBJECTIVE | CLASS | IDENTIFICATIO | RESPONDENTS SEX N | male female | Total --------------+----------------------+---------- LOWER CLASS | 10 14 | 24 WORKING CLASS | 57 64 | 121 MIDDLE CLASS | 37 60 | 97 UPPER CLASS | 5 5 | 10 --------------+----------------------+---------- Total | 109 143 | 252

This tells us that in the GSS sample 57 of the respondents are males who consider themselves working class while 60 are females who consider themselves middle class. If you want percentages, see Adding Percentages to tab Output.

The first variable you list will be placed in rows and the second in columns. The table will usually be easier to read if the variable with the most unique values is listed first. On the other hand, if you're thinking of the two variables as a dependent variable and an independent variable, the dependent variable is usually listed first so it goes in the rows.

If you start with a tab command and then add the sum() option, with the name of a continuous variable in the parentheses, Stata will add summary statistics for that variable to each cell of the table:

tab class, sum(edu)

Gives:

SUBJECTIVE | CLASS | Summary of HIGHEST YEAR OF SCHOOL IDENTIFICAT | COMPLETED ION | Mean Std. Dev. Freq. ------------+------------------------------------ LOWER CLA | 11.5 3.5630959 24 WORKING C | 12.570248 3.1247038 121 MIDDLE CL | 14.71134 3.0171688 97 UPPER CLA | 15.2 3.4253954 10 ------------+------------------------------------ Total | 13.396825 3.3473052 252

This tells us that the 24 respondents who consider themselves lower class have an average of 11.5 years of education, while the 10 respondents who consider themselves upper class have an average of 15.2 years of education. Examination of the table as a whole suggests a relationship between formal education and class, which should surprise no one.

You cannot give tab a list of three variables to act on. However, you can use *by* to create separate tables for each value of a categorical variable:

bysort sex: tab class race

This produces:

----------------------------------------------------------------- -> sex = male SUBJECTIVE | CLASS | IDENTIFICATIO | RACE OF RESPONDENT N | white black other | Total --------------+---------------------------------+---------- LOWER CLASS | 6 1 3 | 10 WORKING CLASS | 41 8 8 | 57 MIDDLE CLASS | 30 3 4 | 37 UPPER CLASS | 3 2 0 | 5 --------------+---------------------------------+---------- Total | 80 14 15 | 109 ------------------------------------------------------------------ -> sex = female SUBJECTIVE | CLASS | IDENTIFICATIO | RACE OF RESPONDENT N | white black other | Total --------------+---------------------------------+---------- LOWER CLASS | 7 5 2 | 14 WORKING CLASS | 49 12 3 | 64 MIDDLE CLASS | 45 10 5 | 60 UPPER CLASS | 3 1 1 | 5 --------------+---------------------------------+---------- Total | 104 28 11 | 143

This tells us that the sample includes 6 respondents who are male, white, and identify as lower class, as well as 10 respondents who are female, black, and identify as middle class.

You can add as many variables as you need to the *by* part of the command:

bysort sex divorce: tab class race

However, the amount of output you'll get can become cumbersome quickly.

The sum() option works exactly the same for tables with two or more categorical variables as it does with one categorical variable.

tab class sex, sum(edu)

Produces:

Means, Standard Deviations and Frequencies of HIGHEST YEAR OF SCHOOL COMPLETED SUBJECTIVE | CLASS | IDENTIFICA | RESPONDENTS SEX TION | male female | Total -----------+----------------------+---------- LOWER CLA | 11.7 11.357143 | 11.5 | 4.5227818 2.871803 | 3.5630959 | 10 14 | 24 -----------+----------------------+---------- WORKING C | 12.719298 12.4375 | 12.570248 | 3.1495713 3.1212279 | 3.1247038 | 57 64 | 121 -----------+----------------------+---------- MIDDLE CL | 14.513514 14.833333 | 14.71134 | 3.2966449 2.8532863 | 3.0171688 | 37 60 | 97 -----------+----------------------+---------- UPPER CLA | 15.2 15.2 | 15.2 | 3.0331502 4.1472883 | 3.4253954 | 5 5 | 10 -----------+----------------------+---------- Total | 13.348624 13.433566 | 13.396825 | 3.4490368 3.2793838 | 3.3473052 | 109 143 | 252

This tells us that the 37 males in our sample who identify as middle class have an average of 14.5 years of education, while the 60 females who identify as middle class have an average of 14.8 years of education.

If you need to consider three or more categorical variables, use *by* as described above.

By default Stata only shows percentages for tables with one variable. If you want percentages for other tables, you need to tell it which percentages you want by adding the appropriate option.

Consider the table of sex and class we created earlier:

tab class sex

SUBJECTIVE | CLASS | IDENTIFICATIO | RESPONDENTS SEX N | male female | Total --------------+----------------------+---------- LOWER CLASS | 10 14 | 24 WORKING CLASS | 57 64 | 121 MIDDLE CLASS | 37 60 | 97 UPPER CLASS | 5 5 | 10 --------------+----------------------+---------- Total | 109 143 | 252

If we add the row option, Stata will tell us what percentage of each class is male and what percentage is female in our sample:

tab class sex, row

+----------------+ | Key | |----------------| | frequency | | row percentage | +----------------+ SUBJECTIVE | CLASS | IDENTIFICATIO | RESPONDENTS SEX N | male female | Total --------------+----------------------+---------- LOWER CLASS | 10 14 | 24 | 41.67 58.33 | 100.00 --------------+----------------------+---------- WORKING CLASS | 57 64 | 121 | 47.11 52.89 | 100.00 --------------+----------------------+---------- MIDDLE CLASS | 37 60 | 97 | 38.14 61.86 | 100.00 --------------+----------------------+---------- UPPER CLASS | 5 5 | 10 | 50.00 50.00 | 100.00 --------------+----------------------+---------- Total | 109 143 | 252 | 43.25 56.75 | 100.00

If we add the column (or col) option, Stata will tell us what percentage of the males are in each class and what percentage of the females are in each class:

tab class sex, col

+-------------------+ | Key | |-------------------| | frequency | | column percentage | +-------------------+ SUBJECTIVE | CLASS | IDENTIFICATIO | RESPONDENTS SEX N | male female | Total --------------+----------------------+---------- LOWER CLASS | 10 14 | 24 | 9.17 9.79 | 9.52 --------------+----------------------+---------- WORKING CLASS | 57 64 | 121 | 52.29 44.76 | 48.02 --------------+----------------------+---------- MIDDLE CLASS | 37 60 | 97 | 33.94 41.96 | 38.49 --------------+----------------------+---------- UPPER CLASS | 5 5 | 10 | 4.59 3.50 | 3.97 --------------+----------------------+---------- Total | 109 143 | 252 | 100.00 100.00 | 100.00

If we add the cell option, Stata will tell us what percentage each combination of class and sex is of the total sample.

tab class sex, cell

+-----------------+ | Key | |-----------------| | frequency | | cell percentage | +-----------------+ SUBJECTIVE | CLASS | IDENTIFICATIO | RESPONDENTS SEX N | male female | Total --------------+----------------------+---------- LOWER CLASS | 10 14 | 24 | 3.97 5.56 | 9.52 --------------+----------------------+---------- WORKING CLASS | 57 64 | 121 | 22.62 25.40 | 48.02 --------------+----------------------+---------- MIDDLE CLASS | 37 60 | 97 | 14.68 23.81 | 38.49 --------------+----------------------+---------- UPPER CLASS | 5 5 | 10 | 1.98 1.98 | 3.97 --------------+----------------------+---------- Total | 109 143 | 252 | 43.25 56.75 | 100.00

Which one you want depends on what question you're asking.

By default tab shows value labels for any variable that has them. If you need to see the actual values, add the nolabel option:

tab sex, nolabel

RESPONDENTS | SEX | Freq. Percent Cum. ------------+----------------------------------- 1 | 110 43.31 43.31 2 | 144 56.69 100.00 ------------+----------------------------------- Total | 254 100.00

By default tab does not include missing values in its tables, which makes it easy to forget about them. Add the missing (or miss) option to see them:

tab class, miss

SUBJECTIVE | CLASS | IDENTIFICATIO | N | Freq. Percent Cum. --------------+----------------------------------- LOWER CLASS | 24 9.45 9.45 WORKING CLASS | 121 47.64 57.09 MIDDLE CLASS | 97 38.19 95.28 UPPER CLASS | 10 3.94 99.21 .b | 1 0.39 99.61 na | 1 0.39 100.00 --------------+----------------------------------- Total | 254 100.00

"na", i.e. "Not Applicable", is the value label applied to .c (as you can see if you add the nolabel option). The value .b also means missing, but has no label to say why it's missing. Whenever you use the class variable you should keep in mind that it is missing for two respondents.

The following is a complete do file for this section:

capture log close

log using descriptives.log, replace

clear all

set more off

use gss_sample

tab sex

sum educ

sum educ, detail

tab class sex

tab class, sum(educ)

bysort sex: tab class race

bysort sex divorce: tab class race

tab class sex, sum(edu)

tab class sex

tab class sex, row

tab class sex, col

tab class sex, cell

tab sex, nolabel

tab class, miss

log close

Last Revised: 7/13/2016