SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

5.2 Character variables

These exercises use the mtcars.csv data set.

  1. Import the mtcars.csv data set.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    mtcars_path = Path('..') / 'datasets' / 'mtcars.csv'
    mtcars_in = pd.read_csv(mtcars_path)
    mtcars_in = mtcars_in.rename(columns={'Unnamed: 0': 'make_model'})
    mtcars =  mtcars_in.copy(deep=True)
    
    print(mtcars.dtypes)
    make_model     object
    mpg           float64
    cyl             int64
    disp          float64
    hp              int64
    drat          float64
    wt            float64
    qsec          float64
    vs              int64
    am              int64
    gear            int64
    carb            int64
    dtype: object
  2. Divide the column that has the car name into columns that contain the make and model of the car.

    mtcars = (
        mtcars
            .assign(
                make=lambda df: 
                    df['make_model'].str.extract('(\\S+)', expand=True),
                model=lambda df: 
                    df['make_model'].str.extract('\\s+(.+)', expand=True)))
    
    (mtcars
        .loc[:, ['make_model', 'make', 'model']]
        .head(10)
        .pipe(print))
              make_model     make       model
    0          Mazda RX4    Mazda         RX4
    1      Mazda RX4 Wag    Mazda     RX4 Wag
    2         Datsun 710   Datsun         710
    3     Hornet 4 Drive   Hornet     4 Drive
    4  Hornet Sportabout   Hornet  Sportabout
    5            Valiant  Valiant         NaN
    6         Duster 360   Duster         360
    7          Merc 240D     Merc        240D
    8           Merc 230     Merc         230
    9           Merc 280     Merc         280
  3. Do all observations have a make and model value? If there are missing values, can you fix them? (Hint, use Google to help you.)

    (mtcars
        .query('make != make | model != model')
        .head()
        .pipe(print))
      make_model   mpg  cyl   disp   hp  drat  ...  vs  am  gear  carb     make  model
    5    Valiant  18.1    6  225.0  105  2.76  ...   1   0     3     1  Valiant    NaN
    
    [1 rows x 14 columns]

    There is a missing model name for the make of Valiant

    Googling shows that the Valiant was produced by Plymouth. We can correct the Valiant observation(s) to include Plymouth.

    mtcars =  mtcars_in
    mtcars = (
        mtcars
            .replace({'make_model': {'Valiant': 'Plymouth Valiant'}})
            .assign(
                make=lambda df: 
                    df['make_model'].str.extract(r'(\S+)', expand=True),
                model=lambda df: 
                    df['make_model'].str.extract(r'\s+(.+)', expand=True)))
    
    (mtcars
        .loc[:, ['make_model', 'make', 'model']]
        .head(10)
        .pipe(print))
              make_model      make       model
    0          Mazda RX4     Mazda         RX4
    1      Mazda RX4 Wag     Mazda     RX4 Wag
    2         Datsun 710    Datsun         710
    3     Hornet 4 Drive    Hornet     4 Drive
    4  Hornet Sportabout    Hornet  Sportabout
    5   Plymouth Valiant  Plymouth     Valiant
    6         Duster 360    Duster         360
    7          Merc 240D      Merc        240D
    8           Merc 230      Merc         230
    9           Merc 280      Merc         280

    Note, raw strings are used in the str.extract() methods to avoid the double backslash.

  4. Some car companies have more than one make. In this data Chrysler, Plymouth, and Dodge were all made by Chrysler. Likewise Cadillac and Pontiac are made by GM and Lincoln and Ford are both made by Ford. Create a company variable based on the data in the make variable

    mtcars = (
        mtcars
            .assign(
                company=lambda df: df['make'])
            .replace({
                'company': {
                    'Plymouth': 'Chrysler',
                    'Dodge': 'Chrysler',
                    'Lincoln': 'Ford',
                    'Cadillac': 'GM',
                    'Pontiac': 'GM'}}))
    (mtcars
        .loc[:, ['company', 'make', 'model']]
        .head(15)
        .pipe(print))
         company      make       model
    0      Mazda     Mazda         RX4
    1      Mazda     Mazda     RX4 Wag
    2     Datsun    Datsun         710
    3     Hornet    Hornet     4 Drive
    4     Hornet    Hornet  Sportabout
    5   Chrysler  Plymouth     Valiant
    6     Duster    Duster         360
    7       Merc      Merc        240D
    8       Merc      Merc         230
    9       Merc      Merc         280
    10      Merc      Merc        280C
    11      Merc      Merc       450SE
    12      Merc      Merc       450SL
    13      Merc      Merc      450SLC
    14        GM  Cadillac   Fleetwood

    Putting together all the code to create the company, make, and model would provide the following.

    mtcars = mtcars_in
    mtcars = (
        mtcars
            .replace({
                'make_model': {'Valiant': 'Plymouth Valiant'}})
            .assign(
                make=lambda df: 
                    df['make_model'].str.extract('(\\S+)', expand=False),
                model=lambda df: 
                    df['make_model'].str.extract('\\s+(.+)', expand=True),
                company=lambda df: df.make)
            .replace({
                'company': {
                    'Plymouth': 'Chrysler',
                    'Dodge': 'Chrysler',
                    'Lincoln': 'Ford',
                    'Cadillac': 'GM',
                    'Pontiac': 'GM'}}))
    
    (mtcars
        .loc[:, ['company', 'make', 'model']]
        .head(10)
        .pipe(print))
        company      make       model
    0     Mazda     Mazda         RX4
    1     Mazda     Mazda     RX4 Wag
    2    Datsun    Datsun         710
    3    Hornet    Hornet     4 Drive
    4    Hornet    Hornet  Sportabout
    5  Chrysler  Plymouth     Valiant
    6    Duster    Duster         360
    7      Merc      Merc        240D
    8      Merc      Merc         230
    9      Merc      Merc         280
  5. Create a name for use in displaying results that is a character string composed of make, a space character, if the company name is not the same as the make then the company in parentheses (), and model.

    mtcars = (
        mtcars
            .assign(
                comp_parn=lambda df: 
                    np.where(df['company'] != df['make'], ' (' + df['company'] + ') ', ''),
                name=lambda df:
                    df['make'] + df['comp_parn'] + df['model']))
    
    (mtcars
        .loc[:, ['name', 'make', 'comp_parn', 'model']]
        .head(15)
        .pipe(print))
                               name      make     comp_parn       model
    0                      MazdaRX4     Mazda                       RX4
    1                  MazdaRX4 Wag     Mazda                   RX4 Wag
    2                     Datsun710    Datsun                       710
    3                 Hornet4 Drive    Hornet                   4 Drive
    4              HornetSportabout    Hornet                Sportabout
    5   Plymouth (Chrysler) Valiant  Plymouth   (Chrysler)      Valiant
    6                     Duster360    Duster                       360
    7                      Merc240D      Merc                      240D
    8                       Merc230      Merc                       230
    9                       Merc280      Merc                       280
    10                     Merc280C      Merc                      280C
    11                    Merc450SE      Merc                     450SE
    12                    Merc450SL      Merc                     450SL
    13                   Merc450SLC      Merc                    450SLC
    14      Cadillac (GM) Fleetwood  Cadillac         (GM)    Fleetwood

    or

    mtcars = (
        mtcars
            .assign(
                comp_parn=lambda df: 
                    [' (' + comp + ') ' if comp != make else ''
                    for comp, make in zip(df['company'], df['make'])],
                name=lambda df:
                    df['make'] + df['comp_parn'] + df['model']))
    
    (mtcars
        .loc[:, ['name', 'make', 'comp_parn', 'model']]
        .head(15)
        .pipe(print))
                               name      make     comp_parn       model
    0                      MazdaRX4     Mazda                       RX4
    1                  MazdaRX4 Wag     Mazda                   RX4 Wag
    2                     Datsun710    Datsun                       710
    3                 Hornet4 Drive    Hornet                   4 Drive
    4              HornetSportabout    Hornet                Sportabout
    5   Plymouth (Chrysler) Valiant  Plymouth   (Chrysler)      Valiant
    6                     Duster360    Duster                       360
    7                      Merc240D      Merc                      240D
    8                       Merc230      Merc                       230
    9                       Merc280      Merc                       280
    10                     Merc280C      Merc                      280C
    11                    Merc450SE      Merc                     450SE
    12                    Merc450SL      Merc                     450SL
    13                   Merc450SLC      Merc                    450SLC
    14      Cadillac (GM) Fleetwood  Cadillac         (GM)    Fleetwood