SSCC - Social Science Computing Cooperative Supporting Statistical Analysis for Research

4.3 Copying data sets

4.3.1 Data concepts - Copies of the data

When cleaning and wrangling data, it can be helpful to retain the original imported data frame without changes. This allows you the opportunity to compare your changes to the original data frame to check your work as you clean the data.

When the code to clean a data set takes a meaningful amount of time to run, the cleaned data set can be written to a file. This cleaned data set can then be used for analysis or exploratory work without rerunning the code to clean the data. When this is done, the cleaning code is kept in its own script, separate from the other wrangling, exploratory, and analysis code. When the cleaning time is short, the cleaning code can be run every time the analysis is worked on. There may be no need for a saved clean data file.

When a cleaned data set is saved to a file, it should be given a new name and possibly stored in a different directory. It is a best practice to not overwrite the original data files.

4.3.2 Programming skills

4.3.2.1 Assignment copy verse reference

The assignment command section explained that an assignment gives a name to the object that is defined on right hand side. In this section we expand on this and consider when the right hand side is an object that already has a name. For example, say the data frame returned by read_csv() is given the name df_in. Then we assign df_in to the name df. What objects are df and df_in referencing?

In Python, df_in = df would result in df and df_in referencing the same object. Changes made to the object referenced by df could be seen by displaying df_in and vice versa. Pandas provides a method to create a copy of an object when this is needed.

In R, df_in <- df would result in df and df_in referencing different objects that have the same values and attributes. Changes made to the object referenced by df would apply to only the df object and would not change the df_in object.

4.3.2.2 Parameter copy versus reference

The object and assignment command sections explained that names are references to an object. The functions and their parameters section explained how objects are associated with parameters to functions (and methods) and that functions (and methods) can return an object. In this section we dig a little deeper into how parameter objects are are passed to functions and methods.

Python passes the name reference and not the object itself for each parameter. Therefore, the code of methods and functions use the original objects passed in and can modify these objects. The result of this is objects can be modified without using the assignment command. This means you can write

<object>.<method>()

instead of

<object> = <object>.<method>()

to change a value or attribute of <object>. Not all methods do this. You will need to check the method documentation to determine if the object is modified by its use.

R makes copies of parameters for use by the code of functions and methods. These copied objects exist inside the function or method as well as where the function was called. Modification of the values or attributes of a parameter object by the code of a function only changes the local copy of the object. This requires the use of the assignment operator to change an object or its attributes.

Both of these approaches to parameters have advantages and disadvantages. As such, one approach is no better than the other. It is one of a few programming style differences between R and Python.

The programming concept of scope was introduced in the ggplot-layers section. It was explained that data and aesthetics defined in the ggplot() function could be used by all geom_*() functions associated with it, global scope. But, data and aesthetics defined in a geom_*() are local to that geom_*(). This is similar to the pass by name reference verse pass a copy approaches. The pass a name reference approach is similar to global scope in that there is one and only one copy and other parts of the program are using the same copy. The pass a copy approach is similar to local scope in that a function uses a copy that is separate from the use of the name in other parts of the program.

4.3.3 Examples - R

These examples use the airAccs.csv data set.

  1. We begin by loading the tidyverse, importing the csv file, and renaming the variables.

    library(tidyverse)
    airAccs_path <- file.path("..", "datasets", "airAccs.csv")
    air_accidents_in <- read_csv(airAccs_path, col_types = cols())
    Warning: Missing column names filled in: 'X1' [1]
    air_accidents_in <-
      rename(
        air_accidents_in,
        obs_num = X1,
        date = Date,
        plane_type = planeType,
        dead = Dead,
        aboard = Aboard,
        ground = Ground
        )
    
    glimpse(air_accidents_in)
    Observations: 5,666
    Variables: 8
    $ obs_num    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

    Notice that the data frame was imported with _in appended to the data frame name used in the prior section.

  2. Make a copy of the data frame.

    air_accidents <- air_accidents_in

    An assignment is used to copy the data frame to a name without the _in. This coding practice preserves the original data set and does not modify it. The original data set is then available for comparison as a reference while cleaning the data. This practice may not be possible if the data set is very large.

  3. The following code demonstrates that an object is copied when assigning an object to a new name.

    air_accidents_reference <- air_accidents_in
    air_accidents_reference <- rename(air_accidents_reference, changed_date_name = date)
    
    glimpse(air_accidents_reference)
    Observations: 5,666
    Variables: 8
    $ obs_num           <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
    $ changed_date_name <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09...
    $ location          <chr> "Fort Myer, Virginia", "Atlantic City, New J...
    $ operator          <chr> "Military - U.S. Army", "Military - U.S. Nav...
    $ plane_type        <chr> "Wright Flyer III", "Dirigible", "Curtiss se...
    $ dead              <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20,...
    $ aboard            <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20,...
    $ ground            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

    The following code displays the data frame as it was imported. From this display, you can see that the changes made to the column names was made only in the copy and the as-imported data frame remains unchanged.

    glimpse(air_accidents_in)
    Observations: 5,666
    Variables: 8
    $ obs_num    <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
    $ date       <date> 1908-09-17, 1912-07-12, 1913-08-06, 1913-09-09, 19...
    $ location   <chr> "Fort Myer, Virginia", "Atlantic City, New Jersey",...
    $ operator   <chr> "Military - U.S. Army", "Military - U.S. Navy", "Pr...
    $ plane_type <chr> "Wright Flyer III", "Dirigible", "Curtiss seaplane"...
    $ dead       <dbl> 1, 5, 1, 14, 30, 21, 19, 20, 22, 19, 27, 20, 20, 23...
    $ aboard     <dbl> 2, 5, 1, 20, 30, 41, 19, 20, 22, 19, 28, 20, 20, 23...
    $ ground     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
  4. The data set (with the new names) is output to a csv file. This is done for demonstration purposes. There no need to save this mostly un-modified dataset. This step of writing a data frame to a file would typically only be done if you have changes that either require a lot of time or code to run.

    temp_data_path <- file.path("..", "datasets", "temp_data_to_be_deleted.csv")
    write_csv(air_accidents, temp_data_path)

    The variable names and values are saved. No type information is saved. When the file is input again, there may be a need for some type conversion. Type conversion is covered in the next chapter.

    Note, this will overwrite the file at temp_data_path, if one already existed.

4.3.4 Examples - Python

These examples use the airAccs.csv data set.

  1. We begin by loading the pandas and os packages, importing the csv file, and renaming the variables.

    from pathlib import Path
    import pandas as pd
    import numpy as np
    airAccs_path = Path('..') / 'datasets' / 'airAccs.csv'
    air_accidents_in = pd.read_csv(airAccs_path)
    air_accidents_in = (
        air_accidents_in
            .rename(
                columns={
                    air_accidents_in.columns[0]: 'obs_num',
                    'Date': 'date', 
                    'planeType': 'plane_type',
                    'Dead': 'dead',
                    'Aboard': 'aboard',
                    'Ground': 'ground'}))
    
    print(air_accidents_in.dtypes)
    obs_num         int64
    date           object
    location       object
    operator       object
    plane_type     object
    dead          float64
    aboard        float64
    ground        float64
    dtype: object

    Notice that the data frame was imported with _in appended to the data frame name used in the prior section.

  2. Make a copy of the data frame.

    We will make a copy of the data and name it air_accidents, removing the _in from the name it was input as. This coding practice preserves the original data set and does not modify it. The original data set is then available for comparison as a reference while cleaning the data. This practice may not be possible if the data set is very large.

    The copy() method is used to create an independent copy the data frame. (Recall that the assignment opperator creates another reference to the same physical data.) The deep=True parameter is used to create a complete copy of the data. With out this parameter, copy() will create what is called a shallow copy and some elements and attributes of the new data frame may be shared with the copied-from data frame.

    air_accidents = air_accidents_in.copy(deep=True)
  3. The following code demonstrates that without the use of the copy() method, the two object names reference the same object.

    air_accidents_reference = air_accidents_in
    air_accidents_reference.rename(
        columns={'date': 'changed_date_name'},
        inplace=True)
    
    print(air_accidents_in.dtypes)
    obs_num                int64
    changed_date_name     object
    location              object
    operator              object
    plane_type            object
    dead                 float64
    aboard               float64
    ground               float64
    dtype: object

    The change made to air_accidents is seen in air_accidents_in.

  4. The data set (with the new names) is output to a csv file. This is done for demonstration purposes. There no need to save this mostly un-modified dataset. This step of writing a data frame to a file would typically only be done if you have changes that either require a lot of time or code to run.

    temp_data_path = Path('..') / 'datasets' / 'temp_data_to_be_deleted.csv'
    air_accidents.to_csv(temp_data_path)

    The variable names and values are saved. No type information is saved. When the file is input again, there may be a need for some type conversion. Type conversion is covered in the next chapter.

    Note, this will overwrite the file at temp_data_path, if one already existed.

4.3.5 Exercises

These exercises use the PSID.csv data set that was imported in the prior section.

  1. Import the PSID.csv data set. Set the variable names to something useful, if they are not already. Change at least one name.

  2. Create a copy of the imported data frame that will be used for data cleaning.

  3. Save the data frame as a csv to a file. Make sure to give the file a new name.