# 7 Model Results

If you are starting from this page, please run the code at Libraries and Data Setup before proceeding.

Data visualizations can help us with checking our model assumptions and with understanding what our model means.

Fit a simple model that regresses `income` on `age`, `race`, and `edu` using the `acs_sample` dataset.

Then, add the residuals, both raw and scaled, as columns to the dataframe. `ggplot()` acts on a single dataframe, so it is best to have everything we would like to plot gathered into one dataframe.

``````mod <- lm(income ~ age + race + edu, data = acs_sample)

acs_sample\$mod_res <- residuals(mod)
acs_sample\$mod_res_scaled <- scale(residuals(mod))``````

## 7.1 Residuals

Ordinary least squares (OLS) regression assumed that the model residuals are normally destributed. To assess what that is true of our fitted model, we can use a QQ plot, where “QQ” stands for “quantile quantile.” A QQ plot shows the quantiles of one dataset plotted against the quantiles of another distribution. If the two distributions are the same, the points should fall along a 45-degree line. We can make a QQ plot with `geom_qq()`, which specifies a normal distribution by default (`distribution = stats::qnorm`), and we can pass our residuals to the `sample` argument. Then, we can add a `y = x` line to the plot with `geom_abline()`, and let’s make it red. Deviation from this red line will help us diagnose non-normality.

The `coord_fixed()` function makes the x and y axes’ scales equal, so that our line (and hopefully our residuals) are 45 degrees.

``````ggplot(acs_sample) +
geom_abline(color = "red") +
geom_qq(aes(sample = mod_res_scaled),
size = .5) +
coord_fixed()`````` The points in our QQ plot do not fall along our red 45-degree line, so we have reason to suspect that the residuals are non-normal. The particular deviation of our residuals suggests that our residuals are positively skewed. (For help interpreting QQ plots, check out A Q-Q Plot Dissection kit.) We used income, which tends to be positively skewed, as our outcome, so this violation of the normality assumption makes sense. We might consider transforming our outcome variable and re-running our model.

We can also create a scatterplot of residuals against a predictor to assess linearity (residual mean is zero at every level of x) and homoscedasticity (residual variance is equal at every level of x). We can add a red `geom_hline()` (horizontal line) in addition to a smoothed line with `geom_smooth()`.

``````ggplot(acs_sample, aes(x = age, y = mod_res)) +
geom_hline(yintercept = 0, color = "red") +
geom_point() +
geom_smooth()``````
``## `geom_smooth()` using method = 'loess' and formula 'y ~ x'`` We can also visually identify outliers in our model. There are many outlier metrics and many more rules-of-thumb for each one, so let’s just use Cook’s distance and consider something an outlier if it has a value of more than 4/n.

Add a column that contains the value from `household` if an observation is an outlier, according to our model and residuals. Supply this variable as the `label` aesthetic, which is passed onto `geom_text()` or `geom_label()`.

``````acs_sample %>%
mutate(cooks = cooks.distance(mod),
outlier_household = ifelse(cooks > 4/nrow(.), household, NA)) %>%
ggplot(aes(x = age, y = income, label = outlier_household)) +
geom_point() +
geom_label()``````
``## Warning: Removed 225 rows containing missing values (geom_label).`` ggplot’s `geom_text()` and `geom_label()` functions are centered on the datapoint, and they will overlap if datapoints are close together. Here, this makes our plot hard to read.

To correct this, use the `geom_text_repel()` or `geom_label_repel()` function from the `ggrepel` package to create labels that automatically avoid (i.e, are repelled by) each other. In some cases, a line will connect our label to the datapoint, as seen below with household 584480.

``````acs_sample %>%
mutate(cooks = cooks.distance(mod),
outlier_household = ifelse(cooks > 4/nrow(.), household, NA)) %>%
ggplot(aes(x = age, y = income, label = outlier_household)) +
geom_point() +
geom_label_repel()``````
``## Warning: Removed 225 rows containing missing values (geom_label_repel).`` ## 7.2 Margins Plots

If the residuals do not show a violation of our model assumptions (so let’s pretend we did not see the non-normal residual plot), we can produce margins plots, which show the marginal effect of a predictor, holding other variables constant.

The `ggpredict()` function from the `ggeffects` package creates a nice dataframe with marginal effects.

``ggpredict(mod)``
``````## \$age
## # Predicted values of income
##
## age | Predicted |               95% CI
## --------------------------------------
##  10 |   5204.00 | [-5467.25, 15875.24]
##  20 |   7837.91 | [-1907.05, 17582.87]
##  30 |  10471.83 | [ 1357.73, 19585.93]
##  40 |  13105.74 | [ 4263.61, 21947.87]
##  50 |  15739.66 | [ 6777.87, 24701.45]
##  60 |  18373.57 | [ 8915.35, 27831.79]
##  70 |  21007.49 | [10730.51, 31284.46]
##  90 |  26275.32 | [13666.82, 38883.82]
##
## Adjusted for:
## * race =                 White
## *  edu = Less than High School
##
## \$race
## # Predicted values of income
##
## race     | Predicted |               95% CI
## -------------------------------------------
## White    |  14159.31 | [ 5316.58, 23002.03]
## Black    |   5606.96 | [-7035.53, 18249.45]
## Other    |  17383.83 | [-2662.47, 37430.14]
## Hispanic |   9452.84 | [ -953.17, 19858.86]
## Asian    |  14684.09 | [-8822.55, 38190.74]
##
## Adjusted for:
## * age =                 44.00
## * edu = Less than High School
##
## \$edu
## # Predicted values of income
##
## edu                   | Predicted |               95% CI
## --------------------------------------------------------
## Less than High School |  14159.31 | [ 5316.58, 23002.03]
## High School           |  24520.17 | [18428.06, 30612.28]
## Some College          |  29886.60 | [23212.82, 36560.39]
## Bachelors             |  46146.77 | [37344.65, 54948.88]
## Advanced Degree       |  58790.63 | [45193.58, 72387.69]
##
## Adjusted for:
## *  age = 44.00
## * race = White
##
## attr(,"class")
##  "ggalleffects" "list"
## attr(,"model.name")
##  "mod"``````

To select one term, specify it with the `terms` argument.

``ggpredict(mod, terms = "age")``
``````## # Predicted values of income
##
## age | Predicted |               95% CI
## --------------------------------------
##  10 |   5204.00 | [-5467.25, 15875.24]
##  20 |   7837.91 | [-1907.05, 17582.87]
##  30 |  10471.83 | [ 1357.73, 19585.93]
##  40 |  13105.74 | [ 4263.61, 21947.87]
##  50 |  15739.66 | [ 6777.87, 24701.45]
##  60 |  18373.57 | [ 8915.35, 27831.79]
##  70 |  21007.49 | [10730.51, 31284.46]
##  90 |  26275.32 | [13666.82, 38883.82]
##
## Adjusted for:
## * race =                 White
## *  edu = Less than High School``````

We will need to pick variables by name while plotting, so double-check the names of the columns of this dataframe:

``ggpredict(mod, terms = "age") %>% names()``
``##  "x"         "predicted" "std.error" "conf.low"  "conf.high" "group"``

The names differ from the printed output, so it is good we checked. We can use `x` as our x variable, `predicted` as y, and `conf.low` and `conf.high` as the lower and upper limits of our confidence interval.

To plot the marginal effect of age, we can use `geom_errorbar()`. We saw that `race` was held at “White”, and `edu` at “Less than High School,” so we should add a caption saying this, as well as that our error bars are 95% confidence intervals. Using `\n` in a character string will add a line break.

The `labs()` function allows us to set labels for various elements of the plot, including axes, titles, and captions.

``````ggpredict(mod, terms = "age") %>%
ggplot(aes(x = x, y = predicted)) +
geom_point() +
geom_line() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +
labs(x = "age",
y = "predicted income",
caption = "for White individuals with education less than high school\nerror bars represent 95% confidence interval")`````` We can choose to use error bars with barplots as well, being sure to set `stat = "identity"`.

``````ggpredict(mod, terms = "edu") %>%
ggplot(aes(x = x, y = predicted)) +
geom_bar(stat = "identity") +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high)) +
labs(x = "education",
y = "predicted income",
caption = "for White individuals with education less than high school\nerror bars represent 95% confidence interval")`````` Or, we can lose the bars in favor of points, make our error bar limits a little bit narrower with a `width` argument, and add text with the rounded predicted value with `geom_label()`.

``````ggpredict(mod, terms = "edu") %>%
ggplot(aes(x = x, y = predicted, label = round(predicted))) +
geom_point() +
geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = .5) +
geom_label(hjust = -.2) +
labs(x = "education",
y = "predicted income",
caption = "for White individuals with education less than high school\nerror bars represent 95% confidence interval")`````` 