SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.8 Coding missing values - part 2

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    psid_path = Path('..') / 'datasets' / 'PSID.csv'
    psid_in = pd.read_csv(psid_path)
    psid_in = (
        psid_in
            .rename( columns={
                'Unnamed: 0': 'obs_num',
                'intnum': 'intvw_num', 
                'persnum': 'person_id',
                'married': 'marital_status'}))
    psid = psid_in.copy(deep=True)
    psid = psid.drop(columns='obs_num')
    
    print(psid.dtypes)
    intvw_num           int64
    person_id           int64
    age                 int64
    educatn           float64
    earnings            int64
    hours               int64
    kids                int64
    marital_status     object
    dtype: object
  2. Code NA for the NA/DF and no histories values for the marriage status variable.

    This first solution uses repeated applications of mask() to change the multiple missing indicators to np.NaN

    psid = (
        psid
            .assign(
                marital_status=(lambda df: df
                    .marital_status
                    .mask(df['marital_status'] == 'NA/DF', np.NaN)
                    .mask(df['marital_status'] == 'no histories', np.NaN))))
    
    (psid
        .query('marital_status != marital_status')
        .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']]
        .sort_values(by=['person_id', 'age'])
        .head(n=15)
        .pipe(print))
          intvw_num  person_id  age  educatn  kids marital_status
    1831       2704          2   39      0.0    99            NaN
    2797       5806          2   45      0.0    99            NaN
    3563       6583          2   46     12.0     0            NaN
    4467       8444          2   48      8.0    99            NaN
    1076       1709          2   50      0.0    99            NaN
    1747       2614          3   37      0.0    99            NaN
    1843       2714          3   38     12.0     1            NaN
    1665       2508          3   45     17.0     0            NaN
    3643       6655          4   30     11.0     2            NaN
    2840       5834          4   46     14.0     0            NaN
    3405       6408          4   46     12.0    99            NaN
    357         633          5   34     99.0    99            NaN
    2971       5942          9   31     14.0     2            NaN
    749        1184         21   49      0.0    99            NaN
    1215       1906        170   41     17.0    99            NaN

    Or, this soultion which uses the .isin() method to identify the observations that need to be set to np.NaN.

    psid = (
        psid
            .assign(
                marital_status=(lambda df: df
                    .marital_status
                    .mask(df['marital_status'].isin(['NA/DF']), np.NaN))))

    Or, using replace(). The replace() method is a good option when the conditions being tested are equalities.

    psid = psid.replace(
        {'marital_status': {'NA/DF': np.NaN, 'no histories': np.NaN}})

    Note, this last approach can not be chained without the inplace parameter being set to false.

  3. Change the units on the earnings and hours variables to be thousands of dollars or hours. Use a method that operates on multiple columns.

    Hint, to do the unit change on a variable x, one would do x / 1000.

    psid = (
        psid
            .apply(
                func=lambda x: x / 1000
                if x.name in ['earnings', 'hours'] else x))
    
    (psid
        .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']]
        .head(n=15)
        .pipe(print))
        intvw_num  person_id  age  educatn  kids marital_status
    0           4          4   39     12.0     2        married
    1           4          6   35     12.0     2       divorced
    2           4          7   33     12.0     1        married
    3           4        173   39     10.0     2        married
    4           5          2   47      9.0     5        married
    5           6          4   44     12.0     2        married
    6           6        172   38     16.0     3        married
    7           7          4   38      9.0     4       divorced
    8           7        170   39     12.0     3        married
    9           7        171   37     11.0     5        married
    10         10          3   48     13.0    98       divorced
    11         10        171   47     12.0     3        married
    12         10        178   40     12.0     0      separated
    13         11        171   38     16.0     0        married
    14         13          3   41     12.0     2        married