Stata and the Grammar of Graphics

Doug Hemken

January 2017

Introduction

Stata, like other general purpose statistical software, includes commands for creating graphics based on data.

A conceptual framework that attempts to describe all data-based graphs is The Grammar of Graphics (Second Edition, 2005) by Leland Wilkinson (with several contributors). This gives us a roadmap for navigating statistical graphics in general, and Stata graphics in particular.

In specifying any graph, we must describe:

These four conceptual areas are independent of each other, and in Wilkinson's formulation are further refined into even more independent dimensions.

Once we have specified the data and the graphical objects, everything else will have default values.

Data and Variables

The most common and fundamental graph commands use a variable or variables from Stata's data set. However, there are other possibilities. For example, there are a number of postestimation graphs that rely on the background information that Stata stores after any estimations command, which can include data in the form of scalars, matrices, or macro variables. Still other graph commands take only scalar values as input ("scalar" in the mathematical sense).

As in statistical estimation, you will often have to get your data into shape before you can use it in a graph command, but you will also find that some graph commands do some data manipulation for you.

Graphical Objects

Conceptually, the basic elements for graphing in a two dimensional space are points, lines, and bounded areas. In practice, most software (Stata included) lets us treat such things as bars and box-and-whisker symbols as distinct graphical objects. Stata also has a variety of different line segment objects.

In Stata, the various graphing commands are specified according to the graphical object they produce: scatter produces points, line produces lines, area produces bounded areas, bar produces bars, etc.

The minimum specification for most graph commands is the name of a graphing object, and the names of one or more variables in the data set. Based on these two specifications, everything else necessary to render a graph has a default value.

sysuse auto // load a data set
scatter mpg weight  // specify graphing object and variables
Test

Test

Coordinates and Guides

With the exception of pie charts, Stata largely draws graphs using some version of Cartesian coordinates.

Annotation and Aesthetics