SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

6.2 Tidy data

These examples use the Car.csv data set.

  1. Load the Car.csv data set.

    from pathlib import Path
    import os
    import numpy as np
    import pandas as pd
    car_path = Path('..') / 'datasets' / 'Car.csv'
    car_in = pd.read_csv(car_path)
    car_in = (
        car_in
            .rename(columns={
                'Unnamed: 0': 'id'}))
    
    car =  car_in.copy(deep=True)
    
    print(car.dtypes)
    id              int64
    choice         object
    college         int64
    hsg2            int64
    coml5           int64
    type1          object
    type2          object
    type3          object
    type4          object
    type5          object
    type6          object
    fuel1          object
    fuel2          object
    fuel3          object
    fuel4          object
    fuel5          object
    fuel6          object
    price1        float64
    price2        float64
    price3        float64
    price4        float64
    price5        float64
    price6        float64
    range1          int64
    range2          int64
    range3          int64
    range4          int64
    range5          int64
    range6          int64
    acc1          float64
                   ...   
    pollution1    float64
    pollution2    float64
    pollution3    float64
    pollution4    float64
    pollution5    float64
    pollution6    float64
    size1           int64
    size2           int64
    size3           int64
    size4           int64
    size5           int64
    size6           int64
    space1        float64
    space2        float64
    space3        float64
    space4        float64
    space5          int64
    space6          int64
    cost1           int64
    cost2           int64
    cost3           int64
    cost4           int64
    cost5           int64
    cost6           int64
    station1      float64
    station2      float64
    station3      float64
    station4      float64
    station5      float64
    station6      float64
    Length: 71, dtype: object
    (car
       .head()
       .pipe(print))
       id   choice  college  hsg2  ...  station3 station4 station5 station6
    0   1  choice1        0     0  ...       0.3      0.3      1.0      1.0
    1   2  choice2        1     1  ...       0.1      0.1      1.0      1.0
    2   3  choice5        0     1  ...       0.3      0.3      1.0      1.0
    3   4  choice5        0     0  ...       0.7      0.7      0.1      0.1
    4   5  choice5        0     1  ...       0.7      0.7      1.0      1.0
    
    [5 rows x 71 columns]
  2. Create a new data frame that uses only the variables for vehicle chosen, demographics, and body type options. Tidy this new data frame.

    gather_vars = ['type1', 'type2', 'type3', 'type4', 'type5', 'type6']
    non_gather_vars = ['id', 'choice', 'college', 'hsg2', 'coml5']
    car_type = (
        car
            .loc[:, 'id':'type6']
            .melt(
                id_vars=non_gather_vars,
                value_vars=gather_vars,
                var_name='type_order',
                value_name='type')
            .assign(
                order=lambda df: 
                    df['type_order'].str.extract('type([\\d]+)', expand=True))
            .drop(columns='type_order')
            .sort_values(by=['id', 'order'])
            )
    
    print(car_type.dtypes)
    id          int64
    choice     object
    college     int64
    hsg2        int64
    coml5       int64
    type       object
    order      object
    dtype: object
    (car_type
       .loc[:, ['id', 'choice', 'order',  'college', 'hsg2', 'coml5', 'type']]
       .head(9)
       .pipe(print))
           id   choice order  college  hsg2  coml5     type
    0       1  choice1     1        0     0      0      van
    4654    1  choice1     2        0     0      0   regcar
    9308    1  choice1     3        0     0      0      van
    13962   1  choice1     4        0     0      0  stwagon
    18616   1  choice1     5        0     0      0      van
    23270   1  choice1     6        0     0      0    truck
    1       2  choice2     1        1     1      1   regcar
    4655    2  choice2     2        1     1      1      van
    9309    2  choice2     3        1     1      1   regcar
  3. Using the data set with all the variables on the vehicle options, Tidy the data set.

    To gather multiple variables from multiple sets of columns, one gathers on all columns from all the sets. Then modify the variable created as the key to contain the names of the new variables. Finish by spreading using the column that contains the variable names. This is a common set of steps used when shapping a data frame. That is make the data frame as tall enough to contain everything that will be spread. Then spread to the desired columns.

    non_gather_vars = ['id', 'choice', 'college', 'hsg2', 'coml5']
    non_pivot_vars = non_gather_vars.copy()
    non_pivot_vars.append('order')
    car_tidy = (
        car
            .melt(
                id_vars=non_gather_vars,
                var_name='variable_order',
                value_name='value')
            .assign(
                order=lambda df: 
                    df['variable_order'].str.extract('([\\d]+)', expand=True),
                variable=lambda df: 
                    df['variable_order'].str.extract('([^\\d]+)', expand=True))
            .drop(columns='variable_order')
            .pivot_table(
                index=non_pivot_vars,
                columns='variable',
                values='value',
                aggfunc='first')
            .reset_index()
            .rename_axis(None, axis='index')
            .set_index('order')
            .reset_index()
            .sort_values(by=['id', 'order'])
            )
    
    print(car_tidy.shape)
    (27924, 17)
    (car_tidy
       .head(9)
       .pipe(print))
    variable order  id   choice  college  hsg2  ...  size space speed station     type
    0            1   1  choice1        0     0  ...     3   0.7    95     0.1      van
    1            2   1  choice1        0     0  ...     3   0.7    95     0.1   regcar
    2            3   1  choice1        0     0  ...     2     1   110     0.3      van
    3            4   1  choice1        0     0  ...     2     1   110     0.3  stwagon
    4            5   1  choice1        0     0  ...     3     1   140       1      van
    5            6   1  choice1        0     0  ...     3     1   140       1    truck
    6            1   2  choice2        1     1  ...     3   0.7    85       0   regcar
    7            2   2  choice2        1     1  ...     3   0.7    85       0      van
    8            3   2  choice2        1     1  ...     3     1   140     0.1   regcar
    
    [9 rows x 17 columns]