Stata for Students: Correlations

This article is part of the Stata for Students series. If you are new to Stata we strongly recommend reading all the articles in the Stata Basics section.

Correlations are a measure of how strongly related two quantitative variables are. It can only perfectly measure linear relationships, but a linear relationship will serve as a first approximation to many other kinds of relationships. You can calculate correlations for categorical variables and the results you get will sometimes point you in the right direction, but there are better ways to describe relationships involving categorical variables.

Correlation coefficients range from -1 to 1. A positive correlation coefficient means the two variables tend to move together: an observation which has a high value for one variable is likely to have a high variable for the other, and vice versa. The larger the coefficient the stronger the relationship. A negative correlation coefficient means they tend to move in opposite directions: observations with a high value for one variable are likely to have a low value for the other. Variables which are independent will have a correlation of zero, but variables which are related but not in a linear way can also have a correlation of zero.

Setting Up

If you plan to carry out the examples in this article, make sure you've downloaded the GSS sample to your U:\SFS folder as described in Managing Stata Files. Then create a do file called in that folder that loads the GSS sample as described in Doing Your Work Using Do Files. If you plan on applying what you learn directly to your homework, create a similar do file but have it load the data set used for your assignment.

Calculating Correlations

The correlate command, often abbreviated cor, calculates correlations. List the variables you want correlations for after the command.

cor sei10 educ height weight

This gives you the correlations between the respondent's socioeconomic status, years of education, height, and weight. They are given in the form of a matrix, but only half of the matrix is shown because it is symmetric:


             |    sei10     educ   height   weight
       sei10 |   1.0000
        educ |   0.6205   1.0000
      height |   0.2466   0.1868   1.0000
      weight |   0.1048  -0.0224   0.5282   1.0000

This shows that the correlation between socioeconomic status and education is .6205, which is fairly high. The correlation between socioeconomic status and height, .2466, is weaker, but it's interesting that its positive at all. Keep in mind that correlation does not imply causation. We cannot tell from these results whether high socioeconomic status causes people to grow taller or being tall causes people to have higher socioeconomic status (both can be true, and there's evidence for both theories), or if something else causes people to both grow taller and have higher socioeconomic status.

The correlation between weight and education is essentially zero, but the negative number indicates that people with higher levels of education are likely to have lower levels of weight. It's just a very small effect. On the other hand, given that education and height are positively correlated and height and weight are strongly positively correlated, this raises the possibility that education and weight might have a stronger negative relationship if we could control for height. Multivariate regression allows us to explore that possibility.

Calculating Covariances

If you want covariances instead, add the cov option:

cor sei10 educ height weight, cov


             |    sei10     educ   height   weight
       sei10 |  510.103
        educ |  43.4237  9.59983
      height |  22.7511  2.36376  16.6884
      weight |  99.2858 -2.91236  90.4648  1757.94

Covariances are not bound to fall in the range of -1 to 1, and depend on both how much the variables vary together and how much they vary overall. But the interpretations of positive and negative numbers are similar. The diagonal of the matrix gives you the variance of each variable, or its standard deviation squared.

Complete Do File

capture log close
log using cor.log, replace

clear all
set more off

use gss_sample

cor sei10 educ height weight
cor sei10 educ height weight, cov

log close

Last Revised: 11/17/2016