Stata for Students: Descriptive Statistics

This article is part of the Stata for Students series. If you are new to Stata we strongly recommend reading all the articles in the Stata Basics section.

Descriptive statistics give you a basic understanding one or more variables and how they relate to each other.

Topics Covered in this Section

Frequencies for a Single Categorical Variable

For a variable that describes categories (like sex or race) rather than quantities (like income) frequencies tell you how many observations are in each category. These are examples of univariate statistics, or statistics that describe a single variable.

Categorical variables are also sometimes called factor variables. Indicator variables (also called binary or dummy variables) are just categorical variables with two categories. Frequency tables for a single variable are sometimes called one-way tables.

Summary Statistics for a Single Quantitative Variable

For a variable that describes quantities (like income) the mean tells you what the expected value of the variable is, and the standard deviation tells you how much it varies. However, the median and percentiles often give you a better sense of how the variable is distributed, especially for variables that are not symmetric (like income, which often has a few very high values). These are also univariate statistics.

Quantitative variables are often called continuous variables. Means are often called averages, and variance is just the standard deviation squared. The median is also the 50th percentile.

Frequencies for Two Categorical Variables

For two categorical variables, frequencies tell you how many observations fall in each combination of the two categorical variables (like black women or hispanic men) and can give you a sense of the relationship between the two variables. These are examples of bivariate statistics, or statistics that describe the joint distribution of the two variables.

Tables of frequencies for two variables are often called two-way tables, contingency tables, or crosstabs.

Summary Statistics for One Quantitative Variable over One Categorical Variable

For a quantitative variable and a categorical variable, the mean value of the quantitative variable for those observations that fall in each category of the categorical variable can give you a sense of how the two variables are related. Of then the question of interest is whether the distribution of the quantitative variable is different for different categories. These are also examples of bivariate statistics.

Frequencies for Three or More Categorical Variables

For three or more categorical variables, frequencies will tell you how many observations fall in each combination of the variables and give you a sense of their relationships just like they did with two categorical variables. These are examples of multivariate statistics.

Summary Statistics for One Quantitative Variable over Two or More Categorical Variables

For a quantitative variable and two or more categorical variables, the the mean value of the quantitative variable for those observations in each combination of the categorical variables can give you a sense of how the variables are related just like they did with a quantitative variable and one categorical variable. These are examples of multivariate statistics.

Commands Used

All of these tasks can be carried out using just two Stata commands: tabulate (or tab) and summarize (or sum). Getting them to do all these things is simply a matter of applying Stata syntax, so so if you've read How Stata Commands Work this section will have no surprises for you.

Some commonly used options can change what the tables produced by tab look like, as described in the sections below:

Adding percentages
Viewing values instead of labels
Viewing missing values

Setting Up

If you plan to carry out the examples in this article, make sure you've downloaded the GSS sample to your U:\SFS folder as described in Managing Stata Files. Then create a do file called descriptives.do in that folder as described in Doing Your Work Using Do Files and start with the following code:

capture log close
log using descriptives.log, replace

clear all
set more off

use gss_sample

// do work here

log close

If you plan on applying what you learn directly to your homework, create a similar do file but have it load the data set used for your assignment.

Frequencies for a Single Categorical Variable

The tabulate command, or just tab, creates tables of frequencies. To have it give you frequencies for a single categorical variable simply tell it which variable you want it to act on:

tab sex

This produces the following output:

RESPONDENTS |
        SEX |      Freq.     Percent        Cum.
------------+-----------------------------------
       male |        110       43.31       43.31
     female |        144       56.69      100.00
------------+-----------------------------------
      Total |        254      100.00

This tells us that in the GSS sample, 110 of the respondents are male (43.31%) and 144 of the respondents are female (56.69%), for a total of 254 respondents.

Summary Statistics for a Single Quantitative Variable

The summarize command, or just sum, creates tables of summary statistics. To have it give you summary statistics for a single variable, simply tell it which variable you want it to act on:

sum educ

This produces the following output:

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        educ |        254    13.38583    3.336343          0         20

This tells us that the mean years of education of the respondents in our GSS sample is 13.4 years.

If you want to see percentiles, add the detail option:

sum educ, detail

This produces the following output:

              HIGHEST YEAR OF SCHOOL COMPLETED
-------------------------------------------------------------
      Percentiles      Smallest
 1%            5              0
 5%            7              2
10%           10              5       Obs                 254
25%           12              6       Sum of Wgt.         254

50%           13                      Mean           13.38583
                        Largest       Std. Dev.      3.336343
75%           16             20
90%           18             20       Variance       11.13118
95%           19             20       Skewness      -.3988663
99%           20             20       Kurtosis       3.899598

The median, 13, is in the table as the 50th percentile. Note how the 25th percentile is 12, just one year lower than the median, while the 75th percentile is four years higher at 16. This shows that the distribution of educ is asymmetric. A histogram (hist educ or hist educ, discrete) would be a good tool to understand its distribution.

Frequencies for Two Categorical Variables

To have tab give you the joint distribution of two categorical variables, tell it which two variables you want it to act on:

tab class sex

This produces the following output:

   SUBJECTIVE |
        CLASS |
IDENTIFICATIO |    RESPONDENTS SEX
            N |      male     female |     Total
--------------+----------------------+----------
  LOWER CLASS |        10         14 |        24 
WORKING CLASS |        57         64 |       121 
 MIDDLE CLASS |        37         60 |        97 
  UPPER CLASS |         5          5 |        10 
--------------+----------------------+----------
        Total |       109        143 |       252

This tells us that in the GSS sample 57 of the respondents are males who consider themselves working class while 60 are females who consider themselves middle class. If you want percentages, see Adding Percentages to tab Output.

The first variable you list will be placed in rows and the second in columns. The table will usually be easier to read if the variable with the most unique values is listed first. On the other hand, if you're thinking of the two variables as a dependent variable and an independent variable, the dependent variable is usually listed first so it goes in the rows.

Summary Statistics for One Quantitative Variable over One Categorical Variable

If you start with a tab command and then add the sum() option, with the name of a continuous variable in the parentheses, Stata will add summary statistics for that variable to each cell of the table:

tab class, sum(edu)

Gives:

 SUBJECTIVE |
      CLASS |  Summary of HIGHEST YEAR OF SCHOOL
IDENTIFICAT |              COMPLETED
        ION |        Mean   Std. Dev.       Freq.
------------+------------------------------------
  LOWER CLA |        11.5   3.5630959          24
  WORKING C |   12.570248   3.1247038         121
  MIDDLE CL |    14.71134   3.0171688          97
  UPPER CLA |        15.2   3.4253954          10
------------+------------------------------------
      Total |   13.396825   3.3473052         252

This tells us that the 24 respondents who consider themselves lower class have an average of 11.5 years of education, while the 10 respondents who consider themselves upper class have an average of 15.2 years of education. Examination of the table as a whole suggests a relationship between formal education and class, which should surprise no one.

Frequencies for Three or More Categorical Variables

You cannot give tab a list of three variables to act on. However, you can use by to create separate tables for each value of a categorical variable:

bysort sex: tab class race

This produces:

-----------------------------------------------------------------
-> sex = male

   SUBJECTIVE |
        CLASS |
IDENTIFICATIO |        RACE OF RESPONDENT
            N |     white      black      other |     Total
--------------+---------------------------------+----------
  LOWER CLASS |         6          1          3 |        10 
WORKING CLASS |        41          8          8 |        57 
 MIDDLE CLASS |        30          3          4 |        37 
  UPPER CLASS |         3          2          0 |         5 
--------------+---------------------------------+----------
        Total |        80         14         15 |       109 


------------------------------------------------------------------
-> sex = female

   SUBJECTIVE |
        CLASS |
IDENTIFICATIO |        RACE OF RESPONDENT
            N |     white      black      other |     Total
--------------+---------------------------------+----------
  LOWER CLASS |         7          5          2 |        14 
WORKING CLASS |        49         12          3 |        64 
 MIDDLE CLASS |        45         10          5 |        60 
  UPPER CLASS |         3          1          1 |         5 
--------------+---------------------------------+----------
        Total |       104         28         11 |       143

This tells us that the sample includes 6 respondents who are male, white, and identify as lower class, as well as 10 respondents who are female, black, and identify as middle class.

You can add as many variables as you need to the by part of the command:

bysort sex divorce: tab class race

However, the amount of output you'll get can become cumbersome quickly.

Summary Statistics for One Quantitative Variable over Two or More Categorical Variables

The sum() option works exactly the same for tables with two or more categorical variables as it does with one categorical variable.

tab class sex, sum(edu)

Produces:

                Means, Standard Deviations and Frequencies
                    of HIGHEST YEAR OF SCHOOL COMPLETED

SUBJECTIVE |
     CLASS |
IDENTIFICA |   RESPONDENTS SEX
      TION |      male     female |     Total
-----------+----------------------+----------
 LOWER CLA |      11.7  11.357143 |      11.5
           | 4.5227818   2.871803 | 3.5630959
           |        10         14 |        24
-----------+----------------------+----------
 WORKING C | 12.719298    12.4375 | 12.570248
           | 3.1495713  3.1212279 | 3.1247038
           |        57         64 |       121
-----------+----------------------+----------
 MIDDLE CL | 14.513514  14.833333 |  14.71134
           | 3.2966449  2.8532863 | 3.0171688
           |        37         60 |        97
-----------+----------------------+----------
 UPPER CLA |      15.2       15.2 |      15.2
           | 3.0331502  4.1472883 | 3.4253954
           |         5          5 |        10
-----------+----------------------+----------
     Total | 13.348624  13.433566 | 13.396825
           | 3.4490368  3.2793838 | 3.3473052
           |       109        143 |       252

This tells us that the 37 males in our sample who identify as middle class have an average of 14.5 years of education, while the 60 females who identify as middle class have an average of 14.8 years of education.

If you need to consider three or more categorical variables, use by as described above.

Adding Percentages to tab Output

By default Stata only shows percentages for tables with one variable. If you want percentages for other tables, you need to tell it which percentages you want by adding the appropriate option.

Consider the table of sex and class we created earlier:

tab class sex

   SUBJECTIVE |
        CLASS |
IDENTIFICATIO |    RESPONDENTS SEX
            N |      male     female |     Total
--------------+----------------------+----------
  LOWER CLASS |        10         14 |        24 
WORKING CLASS |        57         64 |       121 
 MIDDLE CLASS |        37         60 |        97 
  UPPER CLASS |         5          5 |        10 
--------------+----------------------+----------
        Total |       109        143 |       252

If we add the row option, Stata will tell us what percentage of each class is male and what percentage is female in our sample:

tab class sex, row

+----------------+
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

   SUBJECTIVE |
        CLASS |
IDENTIFICATIO |    RESPONDENTS SEX
            N |      male     female |     Total
--------------+----------------------+----------
  LOWER CLASS |        10         14 |        24 
              |     41.67      58.33 |    100.00 
--------------+----------------------+----------
WORKING CLASS |        57         64 |       121 
              |     47.11      52.89 |    100.00 
--------------+----------------------+----------
 MIDDLE CLASS |        37         60 |        97 
              |     38.14      61.86 |    100.00 
--------------+----------------------+----------
  UPPER CLASS |         5          5 |        10 
              |     50.00      50.00 |    100.00 
--------------+----------------------+----------
        Total |       109        143 |       252 
              |     43.25      56.75 |    100.00

If we add the column (or col) option, Stata will tell us what percentage of the males are in each class and what percentage of the females are in each class:

tab class sex, col

+-------------------+
| Key               |
|-------------------|
|     frequency     |
| column percentage |
+-------------------+

   SUBJECTIVE |
        CLASS |
IDENTIFICATIO |    RESPONDENTS SEX
            N |      male     female |     Total
--------------+----------------------+----------
  LOWER CLASS |        10         14 |        24 
              |      9.17       9.79 |      9.52 
--------------+----------------------+----------
WORKING CLASS |        57         64 |       121 
              |     52.29      44.76 |     48.02 
--------------+----------------------+----------
 MIDDLE CLASS |        37         60 |        97 
              |     33.94      41.96 |     38.49 
--------------+----------------------+----------
  UPPER CLASS |         5          5 |        10 
              |      4.59       3.50 |      3.97 
--------------+----------------------+----------
        Total |       109        143 |       252 
              |    100.00     100.00 |    100.00

If we add the cell option, Stata will tell us what percentage each combination of class and sex is of the total sample.

tab class sex, cell

+-----------------+
| Key             |
|-----------------|
|    frequency    |
| cell percentage |
+-----------------+

   SUBJECTIVE |
        CLASS |
IDENTIFICATIO |    RESPONDENTS SEX
            N |      male     female |     Total
--------------+----------------------+----------
  LOWER CLASS |        10         14 |        24 
              |      3.97       5.56 |      9.52 
--------------+----------------------+----------
WORKING CLASS |        57         64 |       121 
              |     22.62      25.40 |     48.02 
--------------+----------------------+----------
 MIDDLE CLASS |        37         60 |        97 
              |     14.68      23.81 |     38.49 
--------------+----------------------+----------
  UPPER CLASS |         5          5 |        10 
              |      1.98       1.98 |      3.97 
--------------+----------------------+----------
        Total |       109        143 |       252 
              |     43.25      56.75 |    100.00

Which one you want depends on what question you're asking.

Viewing Values Instead of Labels

By default tab shows value labels for any variable that has them. If you need to see the actual values, add the nolabel option:

tab sex, nolabel

RESPONDENTS |
        SEX |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |        110       43.31       43.31
          2 |        144       56.69      100.00
------------+-----------------------------------
      Total |        254      100.00

Viewing Missing Values

By default tab does not include missing values in its tables, which makes it easy to forget about them. Add the missing (or miss) option to see them:

tab class, miss

   SUBJECTIVE |
        CLASS |
IDENTIFICATIO |
            N |      Freq.     Percent        Cum.
--------------+-----------------------------------
  LOWER CLASS |         24        9.45        9.45
WORKING CLASS |        121       47.64       57.09
 MIDDLE CLASS |         97       38.19       95.28
  UPPER CLASS |         10        3.94       99.21
           .b |          1        0.39       99.61
           na |          1        0.39      100.00
--------------+-----------------------------------
        Total |        254      100.00

"na", i.e. "Not Applicable", is the value label applied to .c (as you can see if you add the nolabel option). The value .b also means missing, but has no label to say why it's missing. Whenever you use the class variable you should keep in mind that it is missing for two respondents.

Complete Do File

The following is a complete do file for this section:

capture log close
log using descriptives.log, replace

clear all
set more off

use gss_sample

tab sex

sum educ
sum educ, detail

tab class sex

tab class, sum(educ)

bysort sex: tab class race
bysort sex divorce: tab class race

tab class sex, sum(edu)

tab class sex
tab class sex, row
tab class sex, col
tab class sex, cell

tab sex, nolabel

tab class, miss

log close

Last Revised: 7/13/2016