 Supporting Statistical Analysis for Research

## 5.4 Factors and Indicators

### 5.4.1 Data concepts

#### 5.4.1.1 Indicator variables

An Indicator variable is a categorical variable that has exactly two levels. Logical variables are an example of an indicator variable.

These are an important class of variables for many analyses where factor variable must be converted to a set of indicator variables. Indicators variables often use the values 0 and 1 for the two levels, but not always.

#### 5.4.1.2 Factor variables from numeric variables

Numeric variables can be converted to a factor variable by collapsing values that fall within a set of intervals. This is a form of data reduction. Data reductions typically do not improve an analysis. On the other hand converting a numeric variable to a factor can sometimes make it much easier to see patterns in data during exploration. For example, converting a numeric variable to high, medium, and low intervals allow the variable to be used in facets to see if there are visual differences in a plot.

### 5.4.2 Examples - R

These examples use the Forbes2000.csv data set.

1. We begin by loading the tidyverse, importing the csv file, and naming variables.

library(tidyverse)
forbes_path <- file.path("..", "datasets", "Forbes2000.csv")
forbes_in <- read_csv(forbes_path, col_types = cols())
Warning: Missing column names filled in: 'X1' 
forbes_in <- rename(forbes_in, market_value = marketvalue)

forbes <-
forbes_in %>%
select(-X1)

glimpse(forbes)
Observations: 2,000
Variables: 8
$rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...$ name         <chr> "Citigroup", "General Electric", "American Intl G...
$country <chr> "United States", "United States", "United States"...$ category     <chr> "Banking", "Conglomerates", "Insurance", "Oil & g...
$sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...$ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
$assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...$ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
2. Make the category variable a factor variable

The factor() function can be used to convert variables to factor variables. This is a base R function and it works well with the tidyverse.

forbes <-
forbes %>%
mutate(
category = factor(category)
)

glimpse(forbes)
Observations: 2,000
Variables: 8
$rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...$ name         <chr> "Citigroup", "General Electric", "American Intl G...
$country <chr> "United States", "United States", "United States"...$ category     <fct> Banking, Conglomerates, Insurance, Oil & gas oper...
$sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...$ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
$assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...$ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
3. Factor category using parse_factor().

The tidyverse function parse_factor() will convert a variable to a factor variable. This tidyverse function does additional checks that are not done in the base R function factor(). For example, parse_factor() produces warnings for values that do not match the expected levels.

We start the example by creating the set of levels to use in creating the factor variable.

forbes <- mutate(forbes, category = as.character(category))

category_lev <-
forbes %>%
select(category) %>%
distinct(category) %>%
arrange(category) %>%
pull()

head(category_lev)
 "Aerospace & defense"          "Banking"
 "Business services & supplies" "Capital goods"
 "Chemicals"                    "Conglomerates"               

The levels are now used to create the factor variable.

forbes <-
forbes %>%
mutate(
category = parse_factor(category, levels = category_lev)
)

glimpse(forbes)
Observations: 2,000
Variables: 8
$rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...$ name         <chr> "Citigroup", "General Electric", "American Intl G...
$country <chr> "United States", "United States", "United States"...$ category     <fct> Banking, Conglomerates, Insurance, Oil & gas oper...
$sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...$ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
$assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...$ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
4. Create a factor variable from a numeric variable.

The cut() function from base R provides a means to specify a flexible set of interval ranges. The intervals are specified as a set of break points that will be used as lower and upper end points. The names of the intervals can be set using the labels parameter.

forbes <-
forbes %>%
mutate(
profit_lev = cut(profits,
breaks = c(-Inf, .08, .44, 10, Inf),
labels = c("low", "mid", "high", "very high")
)
)

glimpse(forbes)
Observations: 2,000
Variables: 9
$rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...$ name         <chr> "Citigroup", "General Electric", "American Intl G...
$country <chr> "United States", "United States", "United States"...$ category     <fct> Banking, Conglomerates, Insurance, Oil & gas oper...
$sales <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...$ profits      <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....
$assets <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...$ market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...
$profit_lev <fct> very high, very high, high, very high, very high,... The tidyverse has the cut_interval(), cut_number(), and cut_width() functions. These function divide the values into equal segments, measured by either widths or count of observations. 5. Create an indicator variable to identify NAFTA countries. The %in% operator is used to determine if the set of values on the left is in the set of values on the right. forbes <- forbes %>% mutate( nafta = country %in% c("United States", "Canada", "Mexico") ) glimpse(forbes) Observations: 2,000 Variables: 10$ rank         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15...
$name <chr> "Citigroup", "General Electric", "American Intl G...$ country      <chr> "United States", "United States", "United States"...
$category <fct> Banking, Conglomerates, Insurance, Oil & gas oper...$ sales        <dbl> 94.71, 134.19, 76.66, 222.88, 232.57, 49.01, 44.3...
$profits <dbl> 17.85, 15.59, 6.46, 20.96, 10.27, 10.81, 6.66, 7....$ assets       <dbl> 1264.03, 626.93, 647.66, 166.99, 177.57, 736.45, ...
$market_value <dbl> 255.30, 328.54, 194.87, 277.02, 173.54, 117.55, 1...$ profit_lev   <fct> very high, very high, high, very high, very high,...
\$ nafta        <lgl> TRUE, TRUE, TRUE, TRUE, FALSE, TRUE, FALSE, FALSE...

### 5.4.3 Examples - Python

These examples use the Forbes2000.csv data set.

1. We begin by loading the packages, importing the csv file, and naming the variables.

from pathlib import Path
import pandas as pd
import numpy as np
forbes_path = Path('..') / 'datasets' / 'Forbes2000.csv'
forbes_in = (
forbes_in
.rename(columns={'marketvalue': 'market_value'}))
forbes =  forbes_in.copy(deep=True)

print(forbes.dtypes)
Unnamed: 0        int64
rank              int64
name             object
country          object
category         object
sales           float64
profits         float64
assets          float64
market_value    float64
dtype: object
2. Make the category variable a factor variable

The .astype('category') method can be used to convert variables to category variables.

forbes = forbes.assign(category = lambda df: df['category'].astype('category'))

print(forbes['category'].head())
0                 Banking
1           Conglomerates
2               Insurance
3    Oil & gas operations
4    Oil & gas operations
Name: category, dtype: category
Categories (27, object): [Aerospace & defense, Banking, Business services & supplies, Capital goods, ...,
Telecommunications services, Trading companies, Transportation, Utilities]
3. Factor category using pd.Categorical().

The pandas function pd.Categorical() will convert a variable to a category variable. This pandas function does additional checks that are not done in .astype('category'). For example, pd.Categorical() will produces np.NaN for values that do not match the expected level.

We start the example by creating the set of levels to use in creating the factor variable.

The unique() method returns an np.array object that needs to be converted to a pandas object to use pandas methods.

forbes = forbes_in

category_lev = pd.Series(forbes['category'].unique()).sort_values()

print(category_lev.head())
19             Aerospace & defense
0                          Banking
21                   Capital goods
18                       Chemicals
dtype: object

The levels are now used to create the factor variable.

forbes = (
forbes
.assign(
category = lambda df:
pd.Categorical(df['category'], categories=category_lev)))

print(forbes['category'].head())
0                 Banking
1           Conglomerates
2               Insurance
3    Oil & gas operations
4    Oil & gas operations
Name: category, dtype: category
Categories (27, object): [Aerospace & defense, Banking, Business services & supplies, Capital goods, ...,
Telecommunications services, Trading companies, Transportation, Utilities]

The pd.Categorical() method has a ordered parameter that can be set to True for an ordered categorical variable.

4. Create a factor variable from a numeric variable.

The pd.cut() function from pandas provides a means to specify a flexible set of interval ranges. The intervals can be specified as a set of break points that will be used as lower and upper end points. The names of the intervals can be set using the labels parameter.

forbes = (
forbes
.assign(
profit_lev = lambda df:
pd.cut(
df['profits'],
bins=[-np.inf, .08, .44, 10, np.inf],
labels=['low', 'mid', 'high', 'very high'])))

print(forbes['profit_lev'].head())
0    very high
1    very high
2         high
3    very high
4    very high
Name: profit_lev, dtype: category
Categories (4, object): [low < mid < high < very high]
5. Create an indicator variable to identify NAFTA countries.

The isin() method is used to determine if the values of the object are in the list provided as a parameter to isin().

forbes = (
forbes
.assign(
nafta = lambda df: df
['country']

print(forbes['nafta'].head())
0     True
1     True
2     True
3     True
4    False
Name: nafta, dtype: bool
6. Create indicator variables from category variables.

Pandas provides the get_dummies() method to convert categorical variables to sets of indicator variables.

forbes_dum = pd.get_dummies(forbes, columns=['category'])

print(forbes_dum.dtypes)
Unnamed: 0                                      int64
rank                                            int64
name                                           object
country                                        object
sales                                         float64
profits                                       float64
assets                                        float64
market_value                                  float64
profit_lev                                   category
nafta                                            bool
category_Aerospace & defense                    uint8
category_Banking                                uint8
category_Capital goods                          uint8
category_Chemicals                              uint8
category_Conglomerates                          uint8
category_Construction                           uint8
category_Consumer durables                      uint8
category_Diversified financials                 uint8
category_Drugs & biotechnology                  uint8
category_Food drink & tobacco                   uint8
category_Food markets                           uint8
category_Health care equipment & services       uint8
category_Hotels restaurants & leisure           uint8
category_Household & personal products          uint8
category_Insurance                              uint8
category_Materials                              uint8
category_Media                                  uint8
category_Oil & gas operations                   uint8
category_Retailing                              uint8
category_Semiconductors                         uint8
category_Software & services                    uint8
category_Technology hardware & equipment        uint8
category_Telecommunications services            uint8
category_Transportation                         uint8
category_Utilities                              uint8
dtype: object

### 5.4.4 Exercises

These exercises use the mtcars.csv data set.

1. Import the mtcars.csv data set.

2. Factor the cyl, gear and carb variables.

3. Create a variable that identifies the observations that are in the top 25 percent of miles per gallon. Display a few of these vehicles.

Hint, you will need to find a function to identify the percentage points of a variable.

4. Create a variables that bins the values of hp using the following amounts of hp: 100, 170, 240, and 300.