SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

3.3 Relationships between continuous and categorical variables

3.3.1 Data concepts

3.3.1.1 Categorical variable

A categorical variable can take on a finite set of values. The simplest form of categorical variable is an indicator variable that has only two values. The two values are typically 0 and 1, although other values are used at times. Other categorical variables take on multiple values. These values are often expressed using descriptive character strings. For example, a categorical variable for rank of a professor might use assistant professor, associate professor, and professor as its values. The values of a categorical variable are sometime referred to as levels.

Observations within a category may be more similar to other observations within the same category and have larger differences with observations in different categories. These relationships are sometime referred to as within group and between groups variation. For example, we would expect the salaries of the assistant professor group to be fairly similar, and to generally be different from the salaries in the professor group. These are the kind of relations that can be explored with graphs.

3.3.2 Exploring - Box plots

A box plot is a graph of the distribution of a continuous variable. The graph is based on the quartiles of the variables. The quartiles divide a set of ordered values into four groups with the same number of observations. The smallest values are in the first quartile and the largest values in the fourth quartiles.

The plot uses a box to show the values that are larger than the first quartile and smaller than the fourth quartile. These are the values that are closest to the center (median) of the values. The values within the first and fourth quartiles are shown as a line. These lines are referred to as whiskers. These are the values that are farthest from the center of the values.

One useful way to explore the relationship between a continuous and a categorical variable is with a set of side by side box plots, one for each of the categories. Similarities and differences between the category levels can be seen in the length and position of the boxes and whiskers.

3.3.3 Examples - R

These examples use the auto.csv data set.

  1. We begin by using similar code as in the prior section to load the tidyverse and import the csv file.

    library(tidyverse)

    A categorical variable is needed for these examples. The col_types parameter of read_csv() is used to create a factor variable, what R calls a categorical variable. Factor variables in R will be covered in a future chapter. For now you do not need to know any more than we now can use the origin variable as a categorical variable.

    auto_path <- file.path("..", "datasets", "auto.csv")
    auto <- read_csv(auto_path, col_types = cols(origin = col_factor(NULL)))
    Warning: Missing column names filled in: 'X1' [1]
    glimpse(auto)
    Observations: 392
    Variables: 10
    $ X1           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
    $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 1...
    $ cylinders    <dbl> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6...
    $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390,...
    $ horsepower   <dbl> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190,...
    $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4...
    $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10....
    $ year         <dbl> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7...
    $ origin       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1...
    $ name         <chr> "chevrolet chevelle malibu", "buick skylark 320",...

3.3.3.1 Exploring - Box plots

  1. This example uses origin as the horizontal variable for a boxplot. This results in the creation of a separate boxplot for each level of the origin variable. All the observation with a value of 1 are used in the leftmost boxplot. Similarly the observations for levels 2 and 3 of origin are used in separate boxplots.

    ggplot(data=auto, mapping = aes(x = origin, y = mpg)) + 
      geom_boxplot() +
      theme_bw()

    The above box plot shows that the distribution of mpg values is different within the three levels of origin. The automobiles at level 1 have a lower median value than the other two levels. The lowest mpg for level 3 is about the median of level 1.

3.3.4 Examples - Python

These examples use the auto.csv data set.

  1. We begin by using similar code as in the prior section to load the packages and import the csv file.

    from pathlib import Path
    import pandas as pd
    import plotnine as p9

    A categorical variable is needed for these examples. The dtype parameter of read_csv() is used to create a category variable, what pandas calls a categorical variable. category variables will be covered in a future chapter. For now you do not need to know any more than we now can use the origin variable as a categorical variable.

    auto_path = Path('..') / 'datasets' / 'Auto.csv'
    auto = pd.read_csv(auto_path, dtype={'origin': 'category'})
    
    print(auto.dtypes)
    Unnamed: 0         int64
    mpg              float64
    cylinders          int64
    displacement     float64
    horsepower         int64
    weight             int64
    acceleration     float64
    year               int64
    origin          category
    name              object
    dtype: object

3.3.4.1 Exploring - Box plots

  1. This example uses origin as the horizontal variable for a boxplot. This results in the creation of a separate boxplot for each level of the origin variable. All the observation with a value of 1 are used in the left most boxplot. Similarly the observations for levels 2 and 3 of origin are used in separate boxplots.

    print(
        p9.ggplot(auto, p9.aes(x='origin', y='mpg')) + 
        p9.geom_boxplot() +
        p9.theme_bw())
    <ggplot: (-9223371893262612655)>

    The above box plot shows that the distribution of mpg values is different within the three levels of origin. The automobiles at level 1 have a lower median value than the other two levels. The lowest mpg for level 3 is about the median of level 1.

3.3.5 Exercises

These exercises use the Mroz.csv data set that was imported in the prior section.

  1. Create a boxplot for lwg for women who attended college and women who did not.

  2. Create a boxplot for lwg for men who attended college and men who did not.