SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.6 Subsets of a data frame

  1. Import the PSID.csv data set that was imported in the prior section.

    from pathlib import Path
    import pandas as pd
    psid_path = Path('..') / 'datasets' / 'PSID.csv'
    psid_in = pd.read_csv(psid_path)
    psid_in = (
        psid_in
            .rename( columns={
                'Unnamed: 0': 'obs_num',
                'intnum': 'intvw_num', 
                'persnum': 'person_id',
                'married': 'marital_status'}))
    psid = psid_in.copy(deep=True)
    
    print(psid.dtypes)
    obs_num             int64
    intvw_num           int64
    person_id           int64
    age                 int64
    educatn           float64
    earnings            int64
    hours               int64
    kids                int64
    marital_status     object
    dtype: object

    The obs_num variable is retained for these exaple. The examples of this section operate on row numbers and this variable has the row numbers.

  2. Display the last three rows of the data frame using positional values to subset.

    (psid
        .iloc[-3:, :]
        .pipe(print))
          obs_num  intvw_num  person_id  age  ...  earnings  hours  kids  marital_status
    4853     4854       9302          1   37  ...     22045   2793    98        divorced
    4854     4855       9305          2   40  ...       134     30     3         married
    4855     4856       9306          2   37  ...     33000   2423     4         married
    
    [3 rows x 9 columns]

    Displaying using the tail() function to confirm the correct three rows are displayed.

    (psid
        .tail(3)
        .pipe(print))
          obs_num  intvw_num  person_id  age  ...  earnings  hours  kids  marital_status
    4853     4854       9302          1   37  ...     22045   2793    98        divorced
    4854     4855       9305          2   40  ...       134     30     3         married
    4855     4856       9306          2   37  ...     33000   2423     4         married
    
    [3 rows x 9 columns]
  3. Display the first, third, fifth, and seventh rows of columns two and three.

    (psid
        .iloc[[0, 2, 4, 6], :]
        .pipe(print))
       obs_num  intvw_num  person_id  age  ...  earnings  hours  kids  marital_status
    0        1          4          4   39  ...     77250   2940     2         married
    2        3          4          7   33  ...      8000    693     1         married
    4        5          5          2   47  ...      6500   1683     5         married
    6        7          6        172   38  ...      7000   1144     3         married
    
    [4 rows x 9 columns]
  4. Create a smaller data frame using the first 20 rows.

    psid_small = psid.iloc[1:20, :]