SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.7 Relationships between columns

5.7.1 Data concepts

Columns can contain related data. For example, a patient record may contain a set of indicator variables that indicate different chronic conditions. If you want to know if a patient has no chronic conditions, you would need to test each of the indicator variables.

Tasks that consider multiple related variables often look for a particular condition in the set of variables. In the example above, the common condition was a true value of an indicator variable. Some results of testing for common conditions are: none of the columns meet the particular condition, at least one column meets the particular condition, or all of the column meets all particular condition.

5.7.2 Examples R

These examples use the cps1.csv data set.

  1. We begin by loading the tidyverse, importing the csv file, and naming variables.

    library(tidyverse)
    cps1_path <- file.path("..", "datasets", "cps1.csv")
    cps1_in <- read_csv(cps1_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    cps1_in <- 
      cps1_in %>%
      rename(
        no_deg = nodeg,
        real_earn_74 = re74,
        real_earn_75 = re75,
        real_earn_78 = re78
        )
    cps <-
      cps1_in %>%
      select(-X1)
    
    head(cps)
    # A tibble: 6 x 10
        trt   age  educ black  hisp  marr no_deg real_earn_74 real_earn_75
      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>        <dbl>        <dbl>
    1     0    45    11     0     0     1      1       21517.       25244.
    2     0    21    14     0     0     0      0        3176.        5853.
    3     0    38    12     0     0     1      0       23039.       25131.
    4     0    48     6     0     0     1      1       24994.       25244.
    5     0    18     8     0     0     1      1        1669.       10728.
    6     0    22    11     0     0     1      1       16366.       18449.
    # ... with 1 more variable: real_earn_78 <dbl>
  2. Create an indicator variable to identify someone who had no earnings (unemployed) in any of the years.

    The tidyverse has a set of map() functions that allow you to apply a function over a set of parameters. For example, if you have a function that takes two parameters, and two vectors (variables) that each contain values for one of the parameters, then map would allow you to use each row of these vectors as parameters in a call to the function. A list of the same length as the variables is returned.

    The map() function applies one parameter to the function, map2() applies two vectors as parameters, and pmap() maps a list of parameters to the function.

    Each of the mapping functions have variants of *_if(), *_at(), *_dbl(), etc. These allow controls over which variables to include in a map or what the returned object is. The returned object from *_dbl() is a vector of numeric values.

    This example will use the pmap_lgl() function. The first parameter is the variables to be passed as parameters. This is done using the select() function. The pmap_lgl() function works the same as pmap() except a logical (boolean) vector is returned.

    Note, some functions may need the parameters passed as a vector instead of a list. The lift_vd() function can be used to do this.

    In the relationships between variables considered here, the variables are being tested for a common condition. As such, the variable values are gathered as vector and the test is applied to the vector. The set of parameters into the mapped function are identified with ..., which collects all the parameters together. The any() and all() functions are useful to determine if one or more variables met a condition or all of them do.

    cps <-
      cps %>%
      mutate(
        ever_unemployed = 
          pmap_lgl(select(., real_earn_74:real_earn_78), ~ any(c(...) == 0))
      )
    
    cps %>%
      select(trt, age, real_earn_74:real_earn_78, ever_unemployed) %>%
      group_by(ever_unemployed) %>%
      group_modify(~ head(.x, 3))
    # A tibble: 6 x 6
    # Groups:   ever_unemployed [2]
      ever_unemployed   trt   age real_earn_74 real_earn_75 real_earn_78
      <lgl>           <dbl> <dbl>        <dbl>        <dbl>        <dbl>
    1 FALSE               0    45       21517.       25244.       25565.
    2 FALSE               0    21        3176.        5853.       13496.
    3 FALSE               0    38       23039.       25131.       25565.
    4 TRUE                0    45       25862.           0         3925.
    5 TRUE                0    46       19171.        1318.           0 
    6 TRUE                0    49         392.           0            0 

    Note the use of group_by() and group_modify() to construct a dataframe that shows a few examples from each level of the every_unemployed variable.

    The following alternate implementation uses the min() function to identify the minimum salary earned. If the minimum is 0, then they were unemplyed at least one year.

    cps <-
      cps %>%
      mutate(
        ever_unemployed =
          pmap_dbl(select(., real_earn_74:real_earn_78), ~ min(c(...))) == 0
      )
    
    cps %>%
      select(trt, age, real_earn_74:real_earn_78, ever_unemployed) %>%
      group_by(ever_unemployed) %>%
      group_modify(~ head(.x, 3))
    # A tibble: 6 x 6
    # Groups:   ever_unemployed [2]
      ever_unemployed   trt   age real_earn_74 real_earn_75 real_earn_78
      <lgl>           <dbl> <dbl>        <dbl>        <dbl>        <dbl>
    1 FALSE               0    45       21517.       25244.       25565.
    2 FALSE               0    21        3176.        5853.       13496.
    3 FALSE               0    38       23039.       25131.       25565.
    4 TRUE                0    45       25862.           0         3925.
    5 TRUE                0    46       19171.        1318.           0 
    6 TRUE                0    49         392.           0            0 

    It is best to use the simplest and clearest approach to achive the needed results. Although the use of mathematical functions to achieve logical tests are farily common, they may not be the clearest expression of what is being done. It is best to see if a set of logical expressions can be used and are a clearer expression of what is being done.

  3. Create an indicator variable to identify someone who was never unemployed in any of the years.

    This is a check to see if the earnings for all years is greater than 0.

    cps <-
      cps %>%
      mutate(
        never_unemployed = 
          pmap_lgl(select(., real_earn_74:real_earn_78), ~ all(c(...) > 0))
        )
    
    cps %>%
      select(trt, age, real_earn_74:real_earn_78, never_unemployed) %>%
      group_by(never_unemployed) %>%
      group_modify(~ head(.x, 3))
    # A tibble: 6 x 6
    # Groups:   never_unemployed [2]
      never_unemployed   trt   age real_earn_74 real_earn_75 real_earn_78
      <lgl>            <dbl> <dbl>        <dbl>        <dbl>        <dbl>
    1 FALSE                0    45       25862.           0         3925.
    2 FALSE                0    46       19171.        1318.           0 
    3 FALSE                0    49         392.           0            0 
    4 TRUE                 0    45       21517.       25244.       25565.
    5 TRUE                 0    21        3176.        5853.       13496.
    6 TRUE                 0    38       23039.       25131.       25565.

    The following implementation uses the %in% opperator. The %in% operator returns true if what is on left side in the vector of values on the right side. This can be useful when there is more than one possible matching criteria.

    cps <-
      cps %>%
      mutate(
        never_unemployed = 
          pmap_dbl(select(., real_earn_74:real_earn_78), ~ sum((c(...) %in% 0))) == 0
        )
    
    
    cps %>%
      select(trt, age, real_earn_74:real_earn_78, never_unemployed) %>%
      group_by(never_unemployed) %>%
      group_modify(~ head(.x, 3))
    # A tibble: 6 x 6
    # Groups:   never_unemployed [2]
      never_unemployed   trt   age real_earn_74 real_earn_75 real_earn_78
      <lgl>            <dbl> <dbl>        <dbl>        <dbl>        <dbl>
    1 FALSE                0    45       25862.           0         3925.
    2 FALSE                0    46       19171.        1318.           0 
    3 FALSE                0    49         392.           0            0 
    4 TRUE                 0    45       21517.       25244.       25565.
    5 TRUE                 0    21        3176.        5853.       13496.
    6 TRUE                 0    38       23039.       25131.       25565.

5.7.3 Examples - Python

These examples use the cps1.csv data set.

  1. We begin by loading the packages, importing the csv file, and naming variables.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    cps1_path = Path('..') / 'datasets' / 'cps1.csv'
    cps1_in = pd.read_csv(cps1_path)
    cps1_in = (
        cps1_in.rename(
            columns={
                'nodeg': 'no_deg',
                're74': 'real_earn_74',
                're75': 'real_earn_75',
                're78': 'real_earn_78'}))
    cps =  cps1_in.copy(deep=True)
    
    print(cps.dtypes)
    Unnamed: 0        int64
    trt               int64
    age               int64
    educ              int64
    black             int64
    hisp              int64
    marr              int64
    no_deg            int64
    real_earn_74    float64
    real_earn_75    float64
    real_earn_78    float64
    dtype: object
    print(cps.head())
       Unnamed: 0  trt  age  educ  ...  no_deg  real_earn_74  real_earn_75  real_earn_78
    0           1    0   45    11  ...       1     21516.670     25243.550     25564.670
    1           2    0   21    14  ...       0      3175.971      5852.565     13496.080
    2           3    0   38    12  ...       0     23039.020     25130.760     25564.670
    3           4    0   48     6  ...       1     24994.370     25243.550     25564.670
    4           5    0   18     8  ...       1      1669.295     10727.610      9860.869
    
    [5 rows x 11 columns]
  2. Create an indicator variable to identify someone who had no earnings in any of the years.

    Pandas has a set of methods that can be applied to all elements of a data frame (or a series.) A few examples of these functions are; abs(), isin(), isna(), notna(), mask(), where(), eq(), ne(), ge(), gt(), le(), and lt(). We have already used several of these functions such as where() and isna(). It is the comparison methods that will be useful for these examples.

    Pandas also has a set of methods that can be applied across rows or columns of a data frame. These functions take a set of values and return a single value. A few examples of these functions are; min(), max(), mean(), sum(), all(), and any(). These methods take the axis parameter, that tells the method use ('index') row data as the parameters or ('columns') for columns to be used as the parameters. For example, all(axis='index') will return True for any columun where all the values are True.

    This example uses the eq() method to test for unemployed and the any() method to determine if this occured one or more times. The the axis parameter is set to 1 to find the minimum for each row.

    cps = (
        cps.assign(ever_unemployed = lambda df: df
            .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']]
            .eq(0)
            .any(axis='columns')))
    
    (cps
        .loc[:, ['ever_unemployed', 'age', 'real_earn_74',
                 'real_earn_75', 'real_earn_78']]
        .groupby('ever_unemployed')
        .head(3)
        .pipe(print))
        ever_unemployed  age  real_earn_74  real_earn_75  real_earn_78
    0             False   45    21516.6700     25243.550     25564.670
    1             False   21     3175.9710      5852.565     13496.080
    2             False   38    23039.0200     25130.760     25564.670
    9              True   45    25862.3200         0.000      3924.842
    17             True   46    19171.4300      1317.677         0.000
    20             True   49      391.8534         0.000         0.000

    The apply() method can be used to apply functions that take a set of values and return a single value for each row or column in a data frame. This is useful when you want to apply a function to rows or columns that is not defined by pandas.

    We will use apply() with a lamba function to provide a solution to this example. The columns (or rows) can be matched to different parameters in the function if needed. The use of apply() is a flexible approach to operating on all rows (or columns.)

    cps = (
        cps.assign(ever_unemployed = lambda df: df
            .apply(lambda row:
                (pd.Series([row['real_earn_74'], row['real_earn_75'],
                            row['real_earn_78']])
                    .eq(0)
                    .any()),
                axis='columns')
            .eq(0)))
    
    (cps
        .loc[:, ['ever_unemployed', 'age', 'real_earn_74',
                 'real_earn_75', 'real_earn_78']]
        .groupby('ever_unemployed')
        .head(3)
        .pipe(print))
        ever_unemployed  age  real_earn_74  real_earn_75  real_earn_78
    0              True   45    21516.6700     25243.550     25564.670
    1              True   21     3175.9710      5852.565     13496.080
    2              True   38    23039.0200     25130.760     25564.670
    9             False   45    25862.3200         0.000      3924.842
    17            False   46    19171.4300      1317.677         0.000
    20            False   49      391.8534         0.000         0.000

    Note that while all the parameters of the function were columns of a data frame in this example, scalar values could also be passed as parameters.

    The following alternate implementation uses the min() function to identify the minimum salary earned. If the minimum is 0, then they were unemplyed at least one year.

    cps = (
        cps.assign(ever_unemployed = lambda df: df
            .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']]
            .min(axis='columns')
            .eq(0)))
    
    (cps
        .loc[:, ['ever_unemployed', 'age', 'real_earn_74',
                 'real_earn_75', 'real_earn_78']]
        .groupby('ever_unemployed')
        .head(3)
        .pipe(print))
        ever_unemployed  age  real_earn_74  real_earn_75  real_earn_78
    0             False   45    21516.6700     25243.550     25564.670
    1             False   21     3175.9710      5852.565     13496.080
    2             False   38    23039.0200     25130.760     25564.670
    9              True   45    25862.3200         0.000      3924.842
    17             True   46    19171.4300      1317.677         0.000
    20             True   49      391.8534         0.000         0.000

    The following also uses the min() function. The min() function is done through the apply() function.

    cps = (
        cps.assign(ever_unemployed = lambda df: df
            .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']]
            .apply(lambda row: min(row) == 0, axis='columns')))
    
    (cps
        .loc[:, ['ever_unemployed', 'age', 'real_earn_74',
                 'real_earn_75', 'real_earn_78']]
        .groupby('ever_unemployed')
        .head(3)
        .pipe(print))
        ever_unemployed  age  real_earn_74  real_earn_75  real_earn_78
    0             False   45    21516.6700     25243.550     25564.670
    1             False   21     3175.9710      5852.565     13496.080
    2             False   38    23039.0200     25130.760     25564.670
    9              True   45    25862.3200         0.000      3924.842
    17             True   46    19171.4300      1317.677         0.000
    20             True   49      391.8534         0.000         0.000

    It is best to use the simplest and clearest approach to achive the needed results. Although the use of mathematical functions to achieve logical test are farily common they may not be the clearest expression of what is being done. It is best to see if a set of logical expressions can be used and are a clearer expression of what is being done.

    The more flexible approaches using apply() were shown here to demonstrate this important function with a simple examples. It is not meant to imply the use of apply() is the preferred solution for this problem.

  3. Create an indicator variable to identify someone who was never unemployed in any of the years.

    This is a check to see if the earnings for all years is greater than 0.

    cps = (
        cps.assign(never_unemployed = lambda df: df
            .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']]
            .gt(0)
            .all(axis='columns')))
    
    (cps
        .loc[:, ['never_unemployed', 'age',
                 'real_earn_74', 'real_earn_75', 'real_earn_78']]
        .groupby('never_unemployed')
        .head(3)
        .pipe(print))
        never_unemployed  age  real_earn_74  real_earn_75  real_earn_78
    0               True   45    21516.6700     25243.550     25564.670
    1               True   21     3175.9710      5852.565     13496.080
    2               True   38    23039.0200     25130.760     25564.670
    9              False   45    25862.3200         0.000      3924.842
    17             False   46    19171.4300      1317.677         0.000
    20             False   49      391.8534         0.000         0.000

    The following implementation uses the isin() method. This can be useful when there is more than one possible matching criteria. Here to achieve a test of not in, ~ is used to negate the results of the lambda function.

    cps = (
        cps.assign(never_unemployed = lambda df: ~df
            .loc[:, ['real_earn_74', 'real_earn_75', 'real_earn_78']]
            .isin([0])
            .all(axis='columns')))
    
    (cps
        .loc[:, ['never_unemployed', 'age',
                 'real_earn_74', 'real_earn_75', 'real_earn_78']]
        .groupby('never_unemployed')
        .head(3)
        .pipe(print))
         never_unemployed  age  real_earn_74  real_earn_75  real_earn_78
    0                True   45     21516.670     25243.550      25564.67
    1                True   21      3175.971      5852.565      13496.08
    2                True   38     23039.020     25130.760      25564.67
    40              False   21         0.000         0.000          0.00
    130             False   53         0.000         0.000          0.00
    137             False   30         0.000         0.000          0.00

5.7.4 Exercises

These exercises use the Chile.csv data set.

  1. Import the Chile.csv file.

  2. Find all rows with a missing value in any column using a related columns method.