4.8 Coding missing values - part 2 | Data Wrangling Essentials

SSCC - Social Science Computing Cooperative

Supporting Statistical Analysis for Research

4.8 Coding missing values - part 2

These exercises use the PSID.csv data set that was imported in the prior section.

Import the PSID.csv data set.

from pathlib import Path
import pandas as pd
import numpy as np

psid_path = Path('..') / 'datasets' / 'PSID.csv'
psid_in = pd.read_csv(psid_path)
psid_in = (
    psid_in
        .rename( columns={
            'Unnamed: 0': 'obs_num',
            'intnum': 'intvw_num', 
            'persnum': 'person_id',
            'married': 'marital_status'}))
psid = psid_in.copy(deep=True)
psid = psid.drop(columns='obs_num')

print(psid.dtypes)

intvw_num           int64
person_id           int64
age                 int64
educatn           float64
earnings            int64
hours               int64
kids                int64
marital_status     object
dtype: object

Code NA for the NA/DF and no histories values for the marriage status variable.

This first solution uses repeated applications of mask() to change the multiple missing indicators to np.NaN

psid = (
    psid
        .assign(
            marital_status=(lambda df: df
                .marital_status
                .mask(df['marital_status'] == 'NA/DF', np.NaN)
                .mask(df['marital_status'] == 'no histories', np.NaN))))

(psid
    .query('marital_status != marital_status')
    .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']]
    .sort_values(by=['person_id', 'age'])
    .head(n=15)
    .pipe(print))

      intvw_num  person_id  age  educatn  kids marital_status
1831       2704          2   39      0.0    99            NaN
2797       5806          2   45      0.0    99            NaN
3563       6583          2   46     12.0     0            NaN
4467       8444          2   48      8.0    99            NaN
1076       1709          2   50      0.0    99            NaN
1747       2614          3   37      0.0    99            NaN
1843       2714          3   38     12.0     1            NaN
1665       2508          3   45     17.0     0            NaN
3643       6655          4   30     11.0     2            NaN
2840       5834          4   46     14.0     0            NaN
3405       6408          4   46     12.0    99            NaN
357         633          5   34     99.0    99            NaN
2971       5942          9   31     14.0     2            NaN
749        1184         21   49      0.0    99            NaN
1215       1906        170   41     17.0    99            NaN

Or, this soultion which uses the .isin() method to identify the observations that need to be set to np.NaN.

psid = (
    psid
        .assign(
            marital_status=(lambda df: df
                .marital_status
                .mask(df['marital_status'].isin(['NA/DF']), np.NaN))))

Or, using replace(). The replace() method is a good option when the conditions being tested are equalities.

psid = psid.replace(
    {'marital_status': {'NA/DF': np.NaN, 'no histories': np.NaN}})

Note, this last approach can not be chained without the inplace parameter being set to false.

Change the units on the earnings and hours variables to be thousands of dollars or hours. Use a method that operates on multiple columns.

Hint, to do the unit change on a variable x, one would do x / 1000.

psid = (
    psid
        .apply(
            func=lambda x: x / 1000
            if x.name in ['earnings', 'hours'] else x))

(psid
    .loc[:, ['intvw_num', 'person_id', 'age', 'educatn', 'kids', 'marital_status']]
    .head(n=15)
    .pipe(print))

    intvw_num  person_id  age  educatn  kids marital_status
0           4          4   39     12.0     2        married
1           4          6   35     12.0     2       divorced
2           4          7   33     12.0     1        married
3           4        173   39     10.0     2        married
4           5          2   47      9.0     5        married
5           6          4   44     12.0     2        married
6           6        172   38     16.0     3        married
7           7          4   38      9.0     4       divorced
8           7        170   39     12.0     3        married
9           7        171   37     11.0     5        married
10         10          3   48     13.0    98       divorced
11         10        171   47     12.0     3        married
12         10        178   40     12.0     0      separated
13         11        171   38     16.0     0        married
14         13          3   41     12.0     2        married