SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

2.3 More challenging csv and deliminated files

  1. Import the amis.csv data set.

    Set RStudio to python mode

    library(reticulate)
    repl_python()
    from pathlib import Path
    import pandas as pd
    amis_path = Path('..') / 'datasets' / 'amis.csv'
    amis = pd.read_csv(amis_path)

    The above read_csv() will cause an error. We will need to look at part of the data frame to see what we need to change to import the data set.

    amis = (
        pd.read_csv(
            amis_path,
            sep='^',
            nrows=25))
    print(amis.head())
                                                   speed
    0               Speeds of cars (in miles per hour). 
    1                                             period
    2    A numeric column indicating the time that th...
    3                                            warning
    4    A numeric column indicating whether the loca...
  2. Are there any rows that need to be ignored in the amis data set? If so, modify your import to account for them.

    print(amis.head(15))
                                                    speed
    0                Speeds of cars (in miles per hour). 
    1                                              period
    2     A numeric column indicating the time that th...
    3                                             warning
    4     A numeric column indicating whether the loca...
    5                                                pair
    6     A numeric column giving the pair number at w...
    7                                              Source
    8     The data were kindly made available by Mr. G...
    9                  ,"speed","period","warning","pair"
    10                                         1,26,1,1,1
    11                                         2,26,1,1,1
    12                                         3,26,1,1,1
    13                                         4,26,1,1,1
    14                                         5,27,1,1,1

    There are 10 rows that do not contain either data or column names in the file. We will import the csv file skipping these rows.

    amis = pd.read_csv(amis_path, skiprows=10)
  3. Are there any special symbols that need to be set to missing in the amis data set? If so, modify your import to account for them.

    print(amis.head(15))
        Unnamed: 0  speed  period  warning  pair
    0            1     26       1        1     1
    1            2     26       1        1     1
    2            3     26       1        1     1
    3            4     26       1        1     1
    4            5     27       1        1     1
    5            6     28       1        1     1
    6            7     28       1        1     1
    7            8     28       1        1     1
    8            9     28       1        1     1
    9           10     29       1        1     1
    10          11     29       1        1     1
    11          12     29       1        1     1
    12          13     29       1        1     1
    13          14     29       1        1     1
    14          15     29       1        1     1

    There does not appear to be any missing identifiers in the data set.

  4. Import the mifem.csv data set.

    mifem_path = Path('..') / 'datasets' / 'mifem.csv'
    mifem = pd.read_csv(mifem_path)
  5. Is there any meta data at the top or bottom of the mifem data set? You will need to determine how to view the bottom of a data set. If so, modify your import to account for them.

    print(mifem.head())
       Unnamed: 0 outcome  age  yronset premi  ... diabetes highbp hichol angina stroke
    0           1    live   63       85     n  ...        n      y      y      n      n
    1           6    live   55       85     n  ...        n      y      y      n      n
    2           8    live   68       85     y  ...       nk      y     nk      y      n
    3          10    live   64       85     n  ...        n      y      n      y      n
    4          11    dead   67       85     n  ...       nk     nk     nk     nk     nk
    
    [5 rows x 11 columns]
    print(mifem.tail())
          Unnamed: 0 outcome  age  yronset  ... highbp hichol angina stroke
    1290        6347    live   69       93  ...      y     nk      n      y
    1291        6359    live   54       93  ...      y      y      n      n
    1292        6360    live   64       93  ...      y      y      n      n
    1293        6361    live   36       93  ...      n      y      n      n
    1294        6366    live   65       93  ...      n     nk      n      n
    
    [5 rows x 11 columns]

    There does not appear to be any meta data at the top or bottom of the data frame.

  6. Are there any special symbols that need to be set to missing in the mifem data set? If so, modify your import to account for them.

    print(mifem.head().to_string())
       Unnamed: 0 outcome  age  yronset premi smstat diabetes highbp hichol angina stroke
    0           1    live   63       85     n      x        n      y      y      n      n
    1           6    live   55       85     n      c        n      y      y      n      n
    2           8    live   68       85     y     nk       nk      y     nk      y      n
    3          10    live   64       85     n      x        n      y      n      y      n
    4          11    dead   67       85     n     nk       nk     nk     nk     nk     nk

    There are values of nk in several of the variables. The data descriptions do not identify any other indicator for missing.

    mifem = pd.read_csv(mifem_path, na_values=['', 'nk'])
    
    print(mifem.head().to_string())
       Unnamed: 0 outcome  age  yronset premi smstat diabetes highbp hichol angina stroke
    0           1    live   63       85     n      x        n      y      y      n      n
    1           6    live   55       85     n      c        n      y      y      n      n
    2           8    live   68       85     y    NaN      NaN      y    NaN      y      n
    3          10    live   64       85     n      x        n      y      n      y      n
    4          11    dead   67       85     n    NaN      NaN    NaN    NaN    NaN    NaN