# 5 Two Variables

*If you are starting from this page, please run the code at Libraries and Data Setup before proceeding.*

## 5.1 One Discrete, One Continuous

### 5.1.1 Barplots

Barplots can also be used when plotting two variables. To do so, supply `geom_bar()`

with the `stat = "identity"`

argument to override the default of `stat = "count"`

.

The pipe below calculates the mean income by education level.

```
acs %>%
filter(!is.na(edu)) %>%
group_by(edu) %>%
summarize(income = mean(income, na.rm = T)) %>%
ggplot(aes(x = edu, y = income)) +
geom_bar(stat = "identity") +
coord_flip()
```

We may want to add text with the exact (but rounded) value represented by each bar.

```
acs %>%
filter(!is.na(edu)) %>%
group_by(edu) %>%
summarize(income = mean(income, na.rm = T)) %>%
ggplot(aes(x = edu, y = income)) +
geom_bar(stat = "identity") +
coord_flip() +
geom_text(aes(label = round(income)))
```

The text has been unhelpfully centered at the end of each bar, making it difficult to read.

Adding text and labels in ggplot involves a lot of trial and error. Text positions can be adjusted with horizontally with the `hjust`

argument, and vertically with `vjust`

.

```
acs %>%
filter(!is.na(edu)) %>%
group_by(edu) %>%
summarize(income = mean(income, na.rm = T)) %>%
ggplot(aes(x = edu, y = income)) +
geom_bar(stat = "identity") +
coord_flip() +
geom_text(aes(label = round(income)),
hjust = -.2)
```

The text for the first bar is not fully visible. To fix this, adjust the axis limits.

Remember, the coordinates were flipped, so the horizontal axis *is actually the y-axis* and is mapped to the y aesthetic of `income`

(`aes(..., y = income)`

).

The current “y” limit appears to be around 70,000. Perhaps increasing it to 75,000 will be enough to see the text. We can adjust limits by supplying `ylim()`

with a two-number vector of the minimum and maximum.

```
acs %>%
filter(!is.na(edu)) %>%
group_by(edu) %>%
summarize(income = mean(income, na.rm = T)) %>%
ggplot(aes(x = edu, y = income)) +
geom_bar(stat = "identity") +
coord_flip() +
geom_text(aes(label = round(income)),
hjust = -.2) +
ylim(c(0, 75000))
```

Perfect.

### 5.1.2 Boxplots

Boxplots are another option for visualizing a continuous variable along a discrete variable. Again, `coord_flip()`

can be used to rotate the plot 90 degrees.

```
acs %>%
filter(!is.na(edu), !is.na(income)) %>%
ggplot(aes(x = edu, y = income)) +
geom_boxplot() +
coord_flip()
```

The plot above defaulted to scientific notation for the y-axis labels. (Remember the axes were flipped, so the horizontal axis is actually the y-axis!). To fix the y-axis labels, set the `labels`

argument of `scale_y_continuous()`

to `comma`

, an option available from the `scales`

package.

```
acs %>%
filter(!is.na(edu), !is.na(income)) %>%
ggplot(aes(x = edu, y = income)) +
geom_boxplot() +
coord_flip() +
scale_y_continuous(labels = comma)
```

We might also like to use a combination of geoms to visualize our data. One nice combination is the violin plot and the boxplot.

```
ggplot(acs, aes(x = race, y = age)) +
geom_violin() +
geom_boxplot() +
coord_flip()
```

The boxplot has covered up the violin plot, so we can reduce the width of the boxplot with the `width`

argument.

```
ggplot(acs, aes(x = race, y = age)) +
geom_violin() +
geom_boxplot(width = .25) +
coord_flip()
```

Remember, order matters. Geoms are added one on top of another, so if we plot the boxplot first and the violin plot second…

```
ggplot(acs, aes(x = race, y = age)) +
geom_boxplot(width = .25) +
geom_violin() +
coord_flip()
```

## 5.2 Two Continuous

### 5.2.1 Scatterplots

The relationship of two continuous variables can be visualized with a scatterplot, accomplished with `geom_point()`

.

```
ggplot(acs, aes(x = age, y = income)) +
geom_point()
```

`## Warning: Removed 6173 rows containing missing values (geom_point).`

When you have many points, and here we have over 20,000, scatterplots can become difficult to read. The bottom eighth of the plot is a mass of points, and we cannot be sure how dense that area is.

One option to rectify the plot is to reduce the size of the points. `size`

is measured in millimeters.

```
ggplot(acs, aes(x = age, y = income)) +
geom_point(size = .5)
```

`## Warning: Removed 6173 rows containing missing values (geom_point).`

This helps us see that `age`

is coded as an integer, but it does little to help us see the density of points.

Instead, we can use the `alpha`

argument. This value ranges from 0 to 1, where 0 is transparent and 1 is opaque (the default). We have a lot of points, so we can set the value fairly low.

```
ggplot(acs, aes(x = age, y = income)) +
geom_point(alpha = .1)
```

`## Warning: Removed 6173 rows containing missing values (geom_point).`

Now, the darkness of a point cluster represents its density. Comparing this to the first plot, we see that the upper part of the big mass of points actually represents fewer people than the lower part.

We can also add a linear fit of `y ~ x`

by using `geom_smooth()`

and setting its `method`

to `"lm"`

.

```
ggplot(acs, aes(x = age, y = income)) +
geom_point(alpha = .1) +
geom_smooth(method = "lm")
```

`## `geom_smooth()` using formula 'y ~ x'`

`## Warning: Removed 6173 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 6173 rows containing missing values (geom_point).`

Deleting the method argument will produce a plot with a smoothed conditional mean. A standard error ribbon is included by default (`se = T`

), but the standard error is much too small to see clearly in this plot.

```
ggplot(acs, aes(x = age, y = income)) +
geom_point(alpha = .1) +
geom_smooth()
```

`## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'`

`## Warning: Removed 6173 rows containing non-finite values (stat_smooth).`

`## Warning: Removed 6173 rows containing missing values (geom_point).`

## 5.3 Two Discrete

### 5.3.1 Scatterplots

The bivariate distribution of two discrete variables can be examined with a table:

`table(acs$edu, acs$race, useNA = "ifany")`

```
##
## Asian Black Hispanic Other White
## Less than High School 340 1459 2171 417 5924
## High School 112 645 536 143 4327
## Some College 170 584 455 145 4374
## Bachelors 165 228 113 68 2346
## Advanced Degree 107 94 51 17 1296
## <NA> 42 135 221 63 662
```

`prop.table(table(acs$edu, acs$race, useNA = "ifany"))`

```
##
## Asian Black Hispanic Other
## Less than High School 0.0124042320 0.0532287486 0.0792046698 0.0152134258
## High School 0.0040861000 0.0235315578 0.0195549070 0.0052170741
## Some College 0.0062021160 0.0213060927 0.0165997811 0.0052900401
## Bachelors 0.0060197008 0.0083181321 0.0041225830 0.0024808464
## Advanced Degree 0.0039036848 0.0034294053 0.0018606348 0.0006202116
## <NA> 0.0015322875 0.0049252098 0.0080627508 0.0022984312
##
## White
## Less than High School 0.2161255016
## High School 0.1578620941
## Some College 0.1595767968
## Bachelors 0.0855892010
## Advanced Degree 0.0472820139
## <NA> 0.0241517694
```

One option for plotting this relationship is a scatterplot, which at first seems like a terrible idea:

```
ggplot(acs, aes(x = race, y = edu)) +
geom_point()
```

All we learn from that plot is that all combinations of `race`

and `edu`

were observed in the data.

We can add a small amount of noise (jitter) to the x and y variables by changing `geom_point()`

to `geom_jitter()`

.

```
ggplot(acs, aes(x = race, y = edu)) +
geom_jitter()
```

We can begin to see differences between the variables, but the “White” column is so densely populated that we cannot tell the difference in counts between the levels of education.

A little trial-and-error with the `size`

and `alpha`

arguments of `geom_jitter()`

produces the following:

```
ggplot(acs, aes(x = race, y = edu)) +
geom_jitter(size = .25, alpha = .1)
```

Now, this plot is created from the same data as the contingency tables above, but we are much better at finding patterns in point density than we are in comparing numbers in a table. For example, we see:

- “White” is the most populous level of
`race`

, and “Asian” and “Other” have much fewer individuals - For all levels of
`race`

, the general pattern is that lower levels of`edu`

are more common, and higher levels less common

With just two lines of code and some experimenting, we can produce plots that help us in the beginning stages of data exploration.

### 5.3.2 Barplots

Another choice to visualize two discrete variables is the barplot.

Instead of making `edu`

the y variable, we can assign it to the `fill`

aesthetic, which `geom_bar()`

uses to color the bars.

Setting the `position`

argument of `geom_bar()`

to `"dodge"`

places the bars side by side.

```
ggplot(acs, aes(x = race, fill = edu)) +
geom_bar(position = "dodge")
```

Recall that `edu`

is a factor vector (ordered categorical variable), while `race`

is a character vector (unordered categorical variable). As such, the levels of `edu`

follow the order we supplied, while `race`

defaults to alphabetical If we wanted to change this, we could simply make `race`

a factor and specify its levels with `fct_relevel()`

as we did earlier.

Instead of placing the bars next to each other, we can stack them by setting `position`

to `"stack"`

.

```
ggplot(acs, aes(x = race, fill = edu)) +
geom_bar(position = "stack")
```

The default statistic of the y-axis when using `geom_bar()`

is count. The representation of the levels of `edu`

are difficult to interpret for `Asian`

and `Other`

. To correct this, we can change the y-axis to be the proportion of counts within each `race`

. To do so, set `position`

to `"fill"`

. Although the y-axis is still labeled “count”, but we can tell by its scale that it is a proportion. `position = "fill"`

is a standardized version of `position = "stack"`

, where count bars are stacked and then standardized to have the same height. In this way, the height of each `edu`

bar within each level of `race`

represents the proportion of cases for each `race`

with each level of `edu`

.

(To change the y-axis label, see the section Adding and Changing Labels.)

```
ggplot(acs, aes(x = race, fill = edu)) +
geom_bar(position = "fill")
```

Our problem here is that we are relying on colors to communicate information. Colors are automatically chosen to be evenly spaced around the color space to be distinguishable, but it can be difficult to distinguish them as the number of categories increase. Also, colors can cause difficulties for people with colorblindness, and colors often do not have the same level of contrast when they are printed in grayscale (and some colors are indistinguishable when printed in grayscale).

All that to say, use colors as you wish for personal data visualization, but whenever you produce plots for colleagues or for publication, it is best to avoid colors.

With a little bit of data wrangling (see Data Wrangling with R), we can calculate the percent of each `race`

who have each level of `edu`

rather than having ggplot calculate this with the `fill`

aesthetic. ggplot has some capabilities for calculating summary statistics for us (e.g., counts, proportions), but it is very limited in this regard. It is better to calculate what we want to plot outside of ggplot, pass these calculated values to `ggplot()`

, and then plot them as-is. Our eventual goal is to create a plot that separates each of the `edu`

bars and aligns them to facilitate visual comparison. We first need to run a few calculations to end up with a dataframe with one observation for each combination of `race`

and `edu`

.

The pipe below does the following:

`group_by()`

specifies our grouping variables:`race`

and`edu`

.`summarize()`

then creates a dataset with one observation for each unique combination of`race`

and`edu`

, and a variable`n`

which contains the number of observations (`n()`

) per combination.- Using
`summarize()`

also peels back one layer of our grouping variables, starting from the end, so now data are only grouped by`race`

`mutate()`

creates a new variable`perc`

. It is defined as`n`

, the number of observations in each`race`

-`edu`

combination, divided by`sum(n)`

, the total number of observations for each level of`race`

, our current and only grouping variable. Multiplying this by 100 changes it from a proportion to a percentage.

```
acs %>%
group_by(race, edu) %>%
summarize(n = n()) %>%
mutate(perc = 100*n/sum(n))
```

`## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.`

```
## # A tibble: 30 x 4
## # Groups: race [5]
## race edu n perc
## <chr> <fct> <int> <dbl>
## 1 Asian Less than High School 340 36.3
## 2 Asian High School 112 12.0
## 3 Asian Some College 170 18.2
## 4 Asian Bachelors 165 17.6
## 5 Asian Advanced Degree 107 11.4
## 6 Asian <NA> 42 4.49
## 7 Black Less than High School 1459 46.4
## 8 Black High School 645 20.5
## 9 Black Some College 584 18.6
## 10 Black Bachelors 228 7.25
## # ... with 20 more rows
```

After this wrangling, we pipe the resulting dataframe into `ggplot()`

. We can specify `race`

as our x variable and `perc`

(the percentage of each `race`

with each level of `edu`

) as our y variable.

Then, as with earlier, we need to specify that we do not want a count of observations for our x variable by setting `stat = "identity"`

. Then, stack the barplots.

```
acs %>%
group_by(race, edu) %>%
summarize(n = n()) %>%
mutate(perc = 100*n/sum(n)) %>%
ggplot(aes(x = race, y = perc)) +
geom_bar(stat = "identity")
```

`## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.`

Note that, at this point, this plot contains mostly the same information as the colorful plot we produced above when we specified `fill = edu`

and `position = "fill"`

, with a few exceptions. The scale of the y-axis is 0-100 instead of 0-1, the `edu`

bars are not colored and are separated with thin gray lines, and the levels of `edu`

are in the opposite order. (If we look at “Asian”, the largest bar is at the bottom rather than at the top.)

However, one vital piece of information is not included in this plot, which was communicated by color in a previous plot. Which bars correspond to which levels of `edu`

?

We can communicate this information is by separating the levels of `edu`

and adding labels by using `facet_grid()`

. Faceting allows us to split our one plot into several panels. (See the section With Facets.) Specifying the formula `~ edu`

means that the plots will be split along the education variable.

Then, flipping the coordinates will arrange our barplots horizontally.

```
acs %>%
group_by(race, edu) %>%
summarize(n = n()) %>%
mutate(perc = 100*n/sum(n)) %>%
ggplot(aes(x = race, y = perc)) +
geom_bar(stat = "identity") +
facet_grid(~ edu) +
coord_flip()
```

`## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.`

One advantage of this plot over the colorful stacked barplot is that we can easily compare the proportions within each level of `edu`

. Vertically scan the “Some College” facet, and you can see which levels of `race`

have greater or lesser proportions in this category (“White” has the most, “Hispanic” the least, and the other three are somewhat tied). However, if we try to compare the green “Some College” bars in our stacked barplot, it is much more difficult to compare. Separating the bars into facets facilitates this type of between-`race`

, within-`edu`

comparison.

We can also add text with the percentage shown by each bar. As with before, we need to adjust the y limits to make room for our labels, and then we need to use `hjust`

to move the text to a better place.

Note how we can use `round()`

within our ggplot call to round `perc`

to the nearest tenth.

```
acs %>%
group_by(race, edu) %>%
summarize(n = n()) %>%
mutate(perc = 100*n/sum(n)) %>%
ggplot(aes(x = race, y = perc)) +
geom_bar(stat = "identity") +
facet_grid(~ edu) +
coord_flip() +
ylim(c(0, 90)) +
geom_text(aes(label = round(perc, 1)), hjust = -.1)
```

`## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.`

In this case, since our bars are separated and the column names give information about the `edu`

variable, we can add color for strictly aesthetic purposes.

Specify this with `fill = edu`

.

```
acs %>%
filter(!is.na(edu)) %>%
group_by(race, edu) %>%
summarize(n = n()) %>%
mutate(perc = 100*n/sum(n)) %>%
ggplot(aes(x = race, y = perc, fill = edu)) +
geom_bar(stat = "identity") +
facet_grid(~ edu) +
coord_flip() +
ylim(c(0, 90)) +
geom_text(aes(label = round(perc, 1)), hjust = -.1)
```

`## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.`

Since color does not give any unique information, add `show.legend = F`

to `geom_bar()`

to suppress the legend, since it does not add any information.

```
acs %>%
filter(!is.na(edu)) %>%
group_by(race, edu) %>%
summarize(n = n()) %>%
mutate(perc = 100*n/sum(n)) %>%
ggplot(aes(x = race, y = perc, fill = edu)) +
geom_bar(stat = "identity", show.legend = F) +
facet_grid(~ edu) +
coord_flip() +
ylim(c(0, 90)) +
geom_text(aes(label = round(perc, 1)), hjust = -.1)
```

`## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.`

One issue in the plot is that our facet names are a little long, and “Less than High School” is being cut off. We could experiment with text size, or we can use the `labeller`

argument in our `facet_grid()`

function and specify the maximum number of characters before the line wraps.

While we are at it, we can also increase the text size of the facet text for emphasis with the `strip.text`

argument in `theme()`

.

```
acs %>%
filter(!is.na(edu)) %>%
group_by(race, edu) %>%
summarize(n = n()) %>%
mutate(perc = 100*n/sum(n)) %>%
ggplot(aes(x = race, y = perc, fill = edu)) +
geom_bar(stat = "identity", show.legend = F) +
facet_grid(~ edu, labeller = label_wrap_gen(width=10)) +
coord_flip() +
ylim(c(0, 90)) +
geom_text(aes(label = round(perc, 1)), hjust = -.1) +
theme(strip.text = element_text(size = 12))
```

`## `summarise()` has grouped output by 'race'. You can override using the `.groups` argument.`