SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.4 Factors and Indicators

5.4.1 Data concepts

5.4.1.1 Indicator variables

An Indicator variable is a categorical variable that has exactly two levels. Logical variables are an example of an indicator variable.

These are an important class of variables for many analyses where factor variable must be converted to a set of indicator variables. Indicators variables often use the values 0 and 1 for the two levels, but not always.

5.4.1.2 Factor variables from numeric variables

Numeric variables can be converted to a factor variable by collapsing values that fall within a set of intervals. This is a form of data reduction. Data reductions typically do not improve an analysis. On the other hand converting a numeric variable to a factor can sometimes make it much easier to see patterns in data during exploration. For example, converting a numeric variable to high, medium, and low intervals allow the variable to be used in facets to see if there are visual differences in a plot.

5.4.2 Examples - R

These examples use the Forbes2000.csv data set.

  1. We begin by loading the tidyverse, importing the csv file, and naming variables.

    library(tidyverse)
    forbes_path <- file.path("..", "datasets", "Forbes2000.csv")
    forbes_in <- read_csv(forbes_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    forbes_in <- rename(forbes_in, market_value = marketvalue)
    
    forbes <- 
      forbes_in %>%
      select(-X1)
    
    glimpse(forbes)
    Observations: 2,000
    Variables: 8
    $ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
    $ name         <chr> "Citigroup", "General Electric", "American Intl G...
    $ country      <chr> "United States", "United States", "United States"...
    $ category     <chr> "Banking", "Conglomerates", "Insurance", "Oil & g...
    $ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...
    $ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
    $ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...
    $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
  2. Make the category variable a factor variable

    The factor() function can be used to convert variables to factor variables. This is a base R function and it works well with the tidyverse.

    forbes <-
      forbes %>%
      mutate(
        category = factor(category)
      )
    
    glimpse(forbes)
    Observations: 2,000
    Variables: 8
    $ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
    $ name         <chr> "Citigroup", "General Electric", "American Intl G...
    $ country      <chr> "United States", "United States", "United States"...
    $ category     <fct> Banking, Conglomerates, Insurance, Oil & gas oper...
    $ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...
    $ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
    $ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...
    $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
  3. Factor category using parse_factor().

    The tidyverse function parse_factor() will convert a variable to a factor variable. This tidyverse function does additional checks that are not done in the base R function factor(). For example, parse_factor() produces warnings for values that do not match the expected levels.

    We start the example by creating the set of levels to use in creating the factor variable.

    forbes <- mutate(forbes, category = as.character(category))
    
    category_lev <-
      forbes %>%
      select(category) %>%
      distinct(category) %>%
      arrange(category) %>%
      pull()
    
    head(category_lev)
    [1] "Aerospace & defense"          "Banking"                     
    [3] "Business services & supplies" "Capital goods"               
    [5] "Chemicals"                    "Conglomerates"               

    The levels are now used to create the factor variable.

    forbes <-
      forbes %>%
      mutate(
        category = parse_factor(category, levels = category_lev)
      )
    
    glimpse(forbes)
    Observations: 2,000
    Variables: 8
    $ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
    $ name         <chr> "Citigroup", "General Electric", "American Intl G...
    $ country      <chr> "United States", "United States", "United States"...
    $ category     <fct> Banking, Conglomerates, Insurance, Oil & gas oper...
    $ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...
    $ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
    $ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...
    $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
  4. Create a factor variable from a numeric variable.

    The cut() function from base R provides a means to specify a flexible set of interval ranges. The intervals are specified as a set of break points that will be used as lower and upper end points. The names of the intervals can be set using the labels parameter.

    forbes <-
      forbes %>%
      mutate(
        profit_lev = cut(profits, 
                         breaks = c(-Inf, .08, .44, 10, Inf),
                         labels = c("low", "mid", "high", "very high")
                         )
      )
    
    glimpse(forbes)
    Observations: 2,000
    Variables: 9
    $ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
    $ name         <chr> "Citigroup", "General Electric", "American Intl G...
    $ country      <chr> "United States", "United States", "United States"...
    $ category     <fct> Banking, Conglomerates, Insurance, Oil & gas oper...
    $ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...
    $ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
    $ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...
    $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
    $ profit_lev   <fct> very high, very high, high, very high, very high,...

    The tidyverse has the cut_interval(), cut_number(), and cut_width() functions. These function divide the values into equal segments, measured by either widths or count of observations.

  5. Create an indicator variable to identify NAFTA countries.

    The %in% operator is used to determine if the set of values on the left is in the set of values on the right.

    forbes <-
      forbes %>%
      mutate(
        nafta = country %in% c("United States", "Canada", "Mexico")
        )
    
    glimpse(forbes)
    Observations: 2,000
    Variables: 10
    $ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
    $ name         <chr> "Citigroup", "General Electric", "American Intl G...
    $ country      <chr> "United States", "United States", "United States"...
    $ category     <fct> Banking, Conglomerates, Insurance, Oil & gas oper...
    $ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...
    $ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
    $ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...
    $ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
    $ profit_lev   <fct> very high, very high, high, very high, very high,...
    $ nafta        <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE...

5.4.3 Examples - Python

These examples use the Forbes2000.csv data set.

  1. We begin by loading the packages, importing the csv file, and naming the variables.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    forbes_path = Path('..') / 'datasets' / 'Forbes2000.csv'
    forbes_in = pd.read_csv(forbes_path)
    forbes_in = (
        forbes_in
            .rename(columns={'marketvalue': 'market_value'}))
    forbes =  forbes_in.copy(deep=True)
    
    print(forbes.dtypes)
    Unnamed: 0        int64
    rank              int64
    name             object
    country          object
    category         object
    sales           float64
    profits         float64
    assets          float64
    market_value    float64
    dtype: object
  2. Make the category variable a factor variable

    The .astype('category') method can be used to convert variables to category variables.

    forbes = forbes.assign(category = lambda df: df['category'].astype('category'))
    
    print(forbes['category'].head())
    0                 Banking
    1           Conglomerates
    2               Insurance
    3    Oil & gas operations
    4    Oil & gas operations
    Name: category, dtype: category
    Categories (27, object): [Aerospace & defense, Banking, Business services & supplies, Capital goods, ...,
                              Telecommunications services, Trading companies, Transportation, Utilities]
  3. Factor category using pd.Categorical().

    The pandas function pd.Categorical() will convert a variable to a category variable. This pandas function does additional checks that are not done in .astype('category'). For example, pd.Categorical() will produces np.NaN for values that do not match the expected level.

    We start the example by creating the set of levels to use in creating the factor variable.

    The unique() method returns an np.array object that needs to be converted to a pandas object to use pandas methods.

    forbes = forbes_in
    
    category_lev = pd.Series(forbes['category'].unique()).sort_values()
    
    print(category_lev.head())
    19             Aerospace & defense
    0                          Banking
    22    Business services & supplies
    21                   Capital goods
    18                       Chemicals
    dtype: object

    The levels are now used to create the factor variable.

    forbes = (
        forbes
            .assign(
                category = lambda df:
                    pd.Categorical(df['category'], categories=category_lev)))
    
    print(forbes['category'].head())
    0                 Banking
    1           Conglomerates
    2               Insurance
    3    Oil & gas operations
    4    Oil & gas operations
    Name: category, dtype: category
    Categories (27, object): [Aerospace & defense, Banking, Business services & supplies, Capital goods, ...,
                              Telecommunications services, Trading companies, Transportation, Utilities]

    The pd.Categorical() method has a ordered parameter that can be set to True for an ordered categorical variable.

  4. Create a factor variable from a numeric variable.

    The pd.cut() function from pandas provides a means to specify a flexible set of interval ranges. The intervals can be specified as a set of break points that will be used as lower and upper end points. The names of the intervals can be set using the labels parameter.

    forbes = (
        forbes
            .assign(
                profit_lev = lambda df:
                    pd.cut(
                        df['profits'],
                        bins=[-np.inf, .08, .44, 10, np.inf],
                        labels=['low', 'mid', 'high', 'very high'])))
    
    print(forbes['profit_lev'].head())
    0    very high
    1    very high
    2         high
    3    very high
    4    very high
    Name: profit_lev, dtype: category
    Categories (4, object): [low < mid < high < very high]
  5. Create an indicator variable to identify NAFTA countries.

    The isin() method is used to determine if the values of the object are in the list provided as a parameter to isin().

    forbes = (
        forbes
            .assign(
                nafta = lambda df: df
                    ['country']
                    .isin(['United States', 'Canada', 'Mexico'])))
    
    print(forbes['nafta'].head())
    0     True
    1     True
    2     True
    3     True
    4    False
    Name: nafta, dtype: bool
  6. Create indicator variables from category variables.

    Pandas provides the get_dummies() method to convert categorical variables to sets of indicator variables.

    forbes_dum = pd.get_dummies(forbes, columns=['category'])
    
    print(forbes_dum.dtypes)
    Unnamed: 0                                      int64
    rank                                            int64
    name                                           object
    country                                        object
    sales                                         float64
    profits                                       float64
    assets                                        float64
    market_value                                  float64
    profit_lev                                   category
    nafta                                            bool
    category_Aerospace & defense                    uint8
    category_Banking                                uint8
    category_Business services & supplies           uint8
    category_Capital goods                          uint8
    category_Chemicals                              uint8
    category_Conglomerates                          uint8
    category_Construction                           uint8
    category_Consumer durables                      uint8
    category_Diversified financials                 uint8
    category_Drugs & biotechnology                  uint8
    category_Food drink & tobacco                   uint8
    category_Food markets                           uint8
    category_Health care equipment & services       uint8
    category_Hotels restaurants & leisure           uint8
    category_Household & personal products          uint8
    category_Insurance                              uint8
    category_Materials                              uint8
    category_Media                                  uint8
    category_Oil & gas operations                   uint8
    category_Retailing                              uint8
    category_Semiconductors                         uint8
    category_Software & services                    uint8
    category_Technology hardware & equipment        uint8
    category_Telecommunications services            uint8
    category_Trading companies                      uint8
    category_Transportation                         uint8
    category_Utilities                              uint8
    dtype: object

5.4.4 Exercises

These exercises use the mtcars.csv data set.

  1. Import the mtcars.csv data set.

  2. Factor the cyl, gear and carb variables.

  3. Create a variable that identifies the observations that are in the top 25 percent of miles per gallon. Display a few of these vehicles.

    Hint, you will need to find a function to identify the percentage points of a variable.

  4. Create a variables that bins the values of hp using the following amounts of hp: 100, 170, 240, and 300.