R for Researchers: Data preparation

This article is part of the R for Researchers series. For a list of topics covered by this series, see the Introduction article. If you're new to R we highly recommend reading the articles in order.

Overview

This article will introduce you to commands which are useful in preparing data for analysis. This includes commands to import your data, check for missing data and other errors, create new variables, and change the types of variables. The result of data preparation is a dataset which is in the form needed for your analysis.

The code to import a data set typically follows the session set up code. We recommend that all the code to clean up a data set be kept together following the code which imports the raw data set. It is common to discover additional data preparation needs throughout the analysis of a data set. When these additional data preparation needs are discovered, add the code to the data preparation section of your script rather than the exploration or analysis section you are working on. This will make your code easier to work with.

Understanding the data in a data set as well as the analysis which is to be done is necessary to prepare a data set for analysis. In this article, the understanding of the data set has been provided. The data preparation work included in this article was chosen to demonstrate the general approach and introduce some common functions. It is not meant to be a complete data preparation for this data set.

Preliminaries

You will get the most from this article if you follow along with the examples in RStudio. Working the exercises will further enhance your skills with the material. The following steps will prepare your RStudio session to run this article's examples.

  • Start RStudio and open your RFR project.
  • Confirm that RFR (the name of your project) is displayed in the upper left corner of the RStudio window.
  • Open the SalAnalysis script. You started this script in the R Scripts article.
  • Run all the commands in the script.

The instructions for adding the datasets folder to your RFR project are in the R Scripts article.

Importing data

There are a number of R functions which import data. This article series focus on the most common approach, which is to import datasets which are organized with variables as columns in the data set. The R function we will use is read.table(). The file types we will import data from are text (.txt file type) or comma separated values (.csv file type).

R allows multiple datasets to be loaded simultaneously. Each dataset must be given a name when it is loaded to distinguish it from other datasets.

  • Syntax and parameters for the read.table() function

    name <- read.table("filePath",parameters).

    Returns a data.frame. A data.frame is a two dimensional structure organized in rows and columns. A data.frame is the equivalent of Stata's dataset.

    Name is the name you are giving the dataset in R.

    FilePath is the path and name of the data file on your computer. If the file is in the work directory, then only the file name, with the file extension, is needed. If the file is in another folder, the path to the folder needs to be provided with the file name.

    The ", parameters" is an optional list of parameters.

    Some commonly used parameters are

    header set to FALSE indicates no column names are present. TRUE indicates column names are in the first row. Header has no default value. If header is not present in the call, the presence of a header row is determined from the first row of the data file. If the first row has one less value than the rest of the data, the first row is assumed to be column names.

    sep is a character string and defaults to "". "" indicates that any white space is used as a separator.

    col.names is a set of character strings used as the column names. col.names has no default. The more common coding practice is to provide column names after the file has been imported.

    Other parameters can be explored using the help() function.

If you are not sure if your data has a header row or not, open the file in a text editor and check for column names.

The read.csv() function is a "wrapper" to the read.table() function with the parameter defaults set for .csv files. Sometimes it is convenient to use read.csv(), but read.table() can always be used.

If the data you are provided is in a document with other text, copying and pasting the portion of the file with the data to a .txt file usually results in a file which R can read.

The Salaries dataset will be used for the examples in this article.

  • Enter the following commands into your SalAnalysis script.

    #####################################################
    #####################################################
    ##
    ##   Import Data
    ##
    #####################################################
    #####################################################
    
    SalariesIn <- read.table("Datasets/Salaries.csv", 
                             sep=",",header=TRUE
                             )
    SalariesIn <- read.table("Datasets/Salaries.txt" ) 
  • There are no console results from read.table().

  • The Salaries data set was imported twice. The first import was from a .csv file and the second was from a .txt file. The second import overwrote the SalariesIn object in R.

  • No errors were produced from either call to read.table. So the dataset was successfully imported on both calls to read.table.

  • The import from the .txt file did not use the header option. R correctly determined the file contained headers.

Exercise

Do the following exercise in the AlfAnalysis script.

  1. Import the dataset in alfalfa.txt from your RFR project's Dataset folder.

Solutions

Object types and structures

Anything in R which has a name, i.e. a means to reference it, is an object. Variables, functions, data structures, etc. are all objects. The term "object" will be used in these articles when referring to variables as well as structures.

To be able to prepare data you need to know a little bit about R's object types and structures.

Object types

Data objects have a type and can be organized in structures. A type is the form of what is being stored. Structures provide the relations between what is stored. This can be thought of as: a type defines which atom is used and the structure defines what is built from a set of atoms.

R is loosely typed. This means that R will coerce a variable to the type needed, if it can. What R does for a coerced variable is create a new variable with the needed type and passes this new variable to the function or expression. The original variable's type is not changed. So object types in R are a little less restrictive than in some other languages.

The variable types we will be using in these articles are listed below.

  • Numeric which is a collection of types. R internally uses Integer, double, num, etc. for numeric variables. You will typically not need to be concerned about integer versus double, etc., since R will coerce where needed.
  • Character variables are a string of characters.
  • Logical variables take the value of either TRUE or FALSE, abbreviated as T and F. Numeric values coerced to logical assign FALSE for 0 and TRUE for all other values.
  • Factor variables are stored as integer values. Depending where it is used, a factor may be used as either a numeric, character, or a set of indicator variables.

R has functions which allow you to test what type an object is and also to coerce it to a specified type. The is.type() functions are used to test the type of an object, where type is one of the types above. Similarly as.type() is used to coerce objects.

The following example shows the test to determine if a variable x is numeric and coerces x to numeric.

  • Enter the following commands at the console.

    x <- "15"
    is.numeric(x)
    as.numeric(x)
  • The results of the above commands are shown below.

    > x <- "15"
    > is.numeric(x)
    [1] FALSE
    > as.numeric(x)
    [1] 15
  • False is returned from is.numeric() since x's type is character.

  • 15 is returned from as.numeric() since the character string "15" can be coerced to numeric.

Object structures

Structures are made of elements. An element has a specific type (numeric, character, etc.) and holds a single value. In the structures we will be using, the elements are organized in row and columns.

  • A vector is a one dimensional structure of elements. The elements of the vector must be of the same type. A vector is a single column and the elements of the vector are the rows of the column. Vectors can be of any length.
  • A data.frame is a two dimensional structure, even if there is only a single column in the data.frame. The columns of the data.frame are vectors. The vectors need to be the same length. The vectors do not need to be of the same type and often are not. It is common for a data set to be organized such that the vectors are the variables and each row is an observation.

Other common structures are lists, matrices, and arrays. The use of these structures is similar to vectors and data.frames and can be read about in the help files as needed.

The str() function displays information about the type and structure of an object.

  • Syntax and use of the str() function.

    str(object)

    str() does not return an object. It will display information about the type and structure of object in the console.

    object is any object.

Lets look at the structure of the SalariesIn dataset.

  • Enter the following command in your script and run it.

    str(SalariesIn)
  • The results of the above commands are shown below.

    > str(SalariesIn)
    'data.frame':   397 obs. of  6 variables:
     $ rank         : Factor w/ 3 levels "AssocProf","AsstProf",..: 3 3 2 3 3 1 3 3 3 3 ...
     $ discipline   : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
     $ yrs.since.phd: int  19 20 4 45 40 6 30 45 21 18 ...
     $ yrs.service  : int  18 16 3 39 41 6 23 45 20 18 ...
     $ sex          : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
     $ salary       : int  139750 173200 79750 115000 141500 97000 175000 147765 119250 129000 ...
  • You can see that SalariesIn is of type data.frame. The name and type of each variables is also displayed.

Vectors can be constructed using the c() function. Scalar values as well as vectors can be appended together to create a vector.

  • Syntax and use of the c() function.

    c(object1, object2, . . .)

    Returns a vector of a type which all the objects can be coerced to.

  • Enter the following commands at the console.

    aa <- c(1,7,4,9)
    aa
    bb <- c("alpha","beta")
    bb
    c(1,bb,4)
    c("first",aa,"last")
  • The results of the above commands are shown below.

    > aa <- c(1,7,4,9)
    > aa
    [1] 1 7 4 9
    > bb <- c("alpha","beta")
    > bb
    [1] "alpha" "beta" 
    > c(1,bb,4)
    [1] "1"     "alpha" "beta"  "4"    
    > c("first",aa,"last") 
    [1] "first" "1"     "7"     "4"     "9"     "last" 
  • Both c(1,bb,4) and c("first",aa,"last") result in vectors of type character. Type character is what all the elements in each lists could be coerced to.

Subsetting structures

R allows you to use all or just some of the elements of a structure together. There are several methods for referencing some of elements of a structure. One method to subset a structure is to use [row,column] indexing. Another method, when a column is named, is to use object$ColumnName indexing. Let's look at a few examples.

  • Subsetting using row and column numbers.

    Run the following commands in the console.

    SalariesIn[4,3]           # row 4, column 3
    SalariesIn[2,]            # second row, observation 2
    SalariesIn[2:6,]          # observations 2 through 6
    SalariesIn[c(7,9,16),]    # observations 7, 9, and 16
  • The results are shown below.

    > SalariesIn[4,3]           # row 4, column 3
    [1] 45
    > SalariesIn[2,]            # second row, observation 2
      rank discipline yrs.since.phd yrs.service  sex salary
    2 Prof          B            20          16 Male 173200
    > SalariesIn[2:6,]          # observations 2 through 6
           rank discipline yrs.since.phd yrs.service  sex salary
    2      Prof          B            20          16 Male 173200
    3  AsstProf          B             4           3 Male  79750
    4      Prof          B            45          39 Male 115000
    5      Prof          B            40          41 Male 141500
    6 AssocProf          B             6           6 Male  97000
    > SalariesIn[c(7,9,16),]    # observations 7, 9, and 16
       rank discipline yrs.since.phd yrs.service  sex salary
    7  Prof          B            30          23 Male 175000
    9  Prof          B            21          20 Male 119250
    16 Prof          B            12           3 Male 117150
  • SalariesIn[4,3] selects the 4th observation of the 3rd variable. This is a single value, a scalar.

  • SalariesIn[2,] selects the second observation, all columns because no columns were specified. This is a data.frame of one row.

  • The last two examples above, SalariesIn[2:6,] and SalariesIn[c(7,9,16),], use a vector of integers to specify a range of rows. This can also be done to select a range of columns.

  • Subsetting using column names.

    Run the following commands in the console.

    SalariesIn$yrs.service        # yrs.service variable
    SalariesIn[5,"yrs.service"]   # row 5 of the yrs.service variable
    SalariesIn$discipline[6]      # 6th observation of the 
                                  # discipline variable

    The results are

    > SalariesIn$yrs.service        # yrs.service variable
      [1] 18 16  3 39 41  6 23 45 20 18  8  2  1  0 18  3 20 34 23 36 26 31 30
     [24] 19  8  8 23  3  0  8  4  2  9  2  2  0 21  4 31  9  2 23 27 38 19 15
     [47] 28 19 25  1 28 11  3  9 11  5 21  8  9  3  8  2 31 11  3  8 12 31 17
     [70] 36  2 45 19 34 23  3  3 19  1  2 28 16 20  2 18 14 37  2 25  7  5  7
     [93]  7 38 20  0 12  7 14 26 25 23  5 14 10 28  8  8  8 31 16 16  1 37  0
    [116]  9 29 36  1  3 14 32 22 22 22 49 26  0 30  2  9 57  8  1 25 18 14 14
    [139]  7 18  8 10 11  3 27 28  4 27 26  3 12  4  9 10  0 21 18  0  6 16  2
    [162] 19  7  3  0  8 16 19  6 18  5 19 24 20  6 25  7  9 14  3 11  5  8 22
    [185] 23 30 10 10 28 19  9 22 18 19 53  7  4  4 33 22  4 40 17 17  5  2 33
    [208] 18  2 20  3 39  7 19  1 11 11 22  7 11 21 10  6 20 35 20  1  7 11 38
    [231] 27 24 19 19  3 17 25  6 40  6  3 30 37 23 23 11 23 18 23  7 39  8 12
    [254]  2  7  8 22 23  3 30 33 45 26 31 35 30 43 10 44  7 40 18  1  4  3  6
    [277] 48 27 18 46 38 27 51 43  6 49 27  0 27  5  7 28  9  1  7 36 18 11 43
    [300] 39 36 16 13  4 44 31  4 28  0 15  7  9 19 35  6  3  9 45 16 15 23  9
    [323] 11 15 31  4 15 37 10 23 60  9 10 19  6 38 23 12 25 15 11 17 38 31 35
    [346] 10 27 33  3 28 49 38 27 20  1 21 40 35 14  4 11 15 30 17 43 40 10  1
    [369] 30 31  8 20  7 26 19 26  1  3 38  8  3 23  5 44 21  9 27 15 36 18 19
    [392] 19 30 19 25 15  4
    > SalariesIn[5,"yrs.service"]   # row 5 of the yrs.service variable
    [1] 41
    > SalariesIn$discipline[6]      # 6th observation of the 
    [1] B
    Levels: A B
    >                               # discipline variable
    • SalariesIn$yrs.service selects the yrs.service vector.
    • The last two examples above, SalariesIn[5,"yrs.service"] and SalariesIn$discipline[6], select a single element from the data structure. The first of these uses the column name with the square bracket reference to rows and columns. The second uses the fact that the columnName reference returns a vector and then selects an observation from this vector. This is an example of nesting functions. The $ was used to return a column vector and [] was then used to select a row.

In R both square brackets, [], and parentheses, (), can immediately follow an object name. The square brackets are only used for subsetting data and the parentheses only for function parameters. This allows you to identify an object as either data or a function by the type of bracket that follow it. Functions always have (), even if there are no parameters. An object with square brackets or no brackets or parentheses following it is a data object.

Missing data

Checking for missing data is the start of our data preparation code. We demarcate the start of this section with a comment block.

  • Enter the following commands in your script and run them.

    #####################################################
    #####################################################
    ##
    ##   Data preparation section
    ##
    #####################################################
    #####################################################

It is good practice to check for missing data in a data set before using the data. There are a several ways R can identify a missing element, if it knows the element is missing. If the type of the data element is character, missing can be identified as either NA or "". If the type of the data element is numeric, NA is used. Numeric data may be stored as type character, so it is important to check the element type or check for both NA and "".

It is also fairly common for data sets to have their own unique identifier for a missing value of a variable. As an example of this, survey data will sometimes use -9 for an unanswered question. To fully check for missing data in a data set you need to understand what data you have been given and what the values of the elements mean.

We will use a combination of two functions to check the SalariesIn data set for values missing to R.

  • Syntax and use of the is.na() function.

    is.na(object).

    Returns an object of the same structure as object and the same dimensions as object. Each element of the returned object will be either TRUE or FALSE. There are similar functions of the form is.type() and as.type(). These functions test for or coerce to the the type given it the function name.

  • syntax and use of the sum() function

    sum(object)

    Returns a single numeric value which is the sum of the values in the object.

  • Nesting these two functions provides

    sum(is.na(object))

    Returns the number of NA's in the object. Since TRUE is coerced to 1, the sum of the number of TRUEs is the number of NA's in the object.

  • To check for blanks in an object, use a logical test for "".

    object == ""

We will check the SalariesIn data set for missing data.

  • Enter the following commands in your script and run them.

    sum( is.na(SalariesIn) )
    sum( SalariesIn == "" )
  • The following results are displayed in the console

    [1] 0
    [1] 0

There were no NAs or blanks in the data, since both sums were 0.

It is important to note the use of the functions above. The is.na() function is called from within the sum() function. Much of the power of R comes from the ability to combine simpler functions together like this to perform useful tasks. This practice will be demonstrated further through these articles.

Changing types and creating new variables

When a dataset is not too large (where too large is determined by your computer's memory), it is useful to keep the original dataset unchanged and prepare the dataset in another data.frame in the R environment. We will do this by creating a new dataset to which we will make our changes.

  • Enter the following command in your script and run it.

    salary <- SalariesIn

Changing variables

A variable is changed by assigning a new value to it. Since a variable is typically a vector of values the assignment is a new vector and not a single value. If the variable is in a data.frame, the reference to the variable will need to identify both the data.frame and the variable.

The first thing we will do to the salary data.frame is change the units for salary. Salaries are typically measured in thousands of dollars. We were given dollars.

  • Enter the following commands in your script and run them.

    salary$salary <- salary$salary / 1000
  • There is nothing displayed in the console. The assignment operator produces no displayable results.

  • The name of the data.frame, salary, and the variable, salary, are the same. This is not an issue for R since it know what each of the names are by its position in the command.

Subsets of a vectors value may be changed by assigning new values to the subset of interest.

Creating new variables

A new variable is created whenever something is assigned to a name which has not been used before. If the new name is used as a reference to a column in an object, the variable is added as a column of that object. If the variable is not associated with an object, the variable is stored in the work space.

We are going to create our new variables in the work space. After we have all the variables we need, the variables will be added to the salary data.frame. Adding vectors to a data.frame will be covered below.

We may want the log of salary, since the distribution of salaries are often skewed.

  • Enter the following command in your script and run it.

    logSalary <- log(salary$salary)

Change variable types

R's ability to coerce types reduces the need to do much type changing of your data. The areas where you will likely need to do some work are grouping and factor variables.

There are two common approaches to creating grouping variables, ifelse() and cut().

  • Syntax and use of the ifelse() function

    ifelse(condition,trueValue,falseValue)

    Returns a vector. The vector uses the value from trueValues for each observation for which condition is true and from falseValues for all others.

    Condition is a vector of type logical. This is TRUE or FALSE for each observation in the variable.

    TrueValue and falseValue are vectors. If a scalar value is provided, it will be repeated for each observation.

  • Syntax and use of the cut() function

    cut(vector,breakValues,labels=labels)

    Returns a character variable.

    Vector is the variable for which each of its elements are to be assigned to groups.

    The breakValues are used to determine which group each element in vector is assign to. An element is assigned to a group if its value falls between two of the breakValues. R has a special value inf which is the largest value a variable can take. It is often convenient to use inf as bounds in cut().

    The labels are the names of the groups formed by breakValues.

It may also be useful to consider groups of salary levels for the professors. We will create groups for high (over 135,000), low (below 91,000), and middle salaries. We will demonstrate both of the approaches given above to create the new variable.

  • Enter the following commands in your script and run them.

    salaryLevel <- ifelse(salary$salary>134, "high",
                  ifelse(salary$salary<91, "low","middle"
                         ) )
    salaryLevel2 <- cut(salary$salary,c(0,91,134,Inf),
                       labels=c("low","middle","high")
                       )
  • There are no console results for these commands.

Lets check to see if the two variables are the same.

  • Enter the following command in your script and run it.

    sum( salaryLevel != salaryLevel2 )
  • The results displayed in the console are

    [1] 2
  • The line of code provides the count of the number of levels which are not the same. Here we have found 2 observations which have been assigned to different groups.

Knowing that there are two differences in the way we generated the salary groups does not tells us why. We will use the which() function to determine which observations are different.

  • The syntax and use of the which() function.

    which(vectorLogical)

    Returns a vector of integers identifying which values in vectorLogical are TRUE.

We will use the which() function with subsetting to find which values of salary were assigned different groups.

  • Enter the following command in your script and run it.

    salary$salary[ which( salaryLevel != salaryLevel2 ) ]
  • The results displayed in the console are.

    [1] 91 91

The differences occur at the 91 cut point. To make them the same we could either change the < 91 to <= 91 in the ifelse() function or increase 91 to 91.0001 in the cut function. Since we do not need both of these variables, we will use the salaryLevel vector as it is.

We will do a str() on salaryLevel to examine the type.

  • Enter the following command in your script and run it.

    str(salaryLevel)
  • The results displayed in the console are.

     chr [1:397] "high" "high" "low" "middle" "high" "middle" ...

It is a character vector. We will want this to be of type factor.

  • The syntax and use of the factor() function is

    factor(vector)

    Returns a factor variable.

    The type of vector does not matter. Factor creates a level for each unique value in the vector.

We will use the factor function to coerce the type of salaryLevel.

  • Enter the following command in your script and run it.

    salaryFactor <- factor(salaryLevel)

Combining objects into a data.frame

The data.frame() function is used to combine objects into a data set.

  • Syntax and use of the data.frame() function.

    data.frame(object1,object2, \(\dots\))

    Returns a data.frame

    object1,object2, \(\dots\) is a list of objects which are either a vector or a structure with columns, such as an existing data.frame.

    • The names of the variables in the data.frame are taken from the vector names or the names in their current object.

    • Names for the variables in the data.frame can be provided for vector objects. This is done using nameI = vectorI, where "nameI" is the name given to variable with values taken from vectorI.

We will add the two new variables to the salary data.frame.

  • Enter the following command in your script and run it.

    salary <- data.frame(salary,
                         logSalary = logSalary,
                         salaryLevels = factor(salaryLevel)
                         )
  • This uses the current variable names from the salary data.frame and assigns logSalary and salaryLevels as the names of the two added variables.

It is also helpful to set the variable names in a data set to fairly short names. This makes working with the variables easier for the rest of the analysis.

  • Syntax and use of the colnames() function.

    colnames(object) <- vector

    • Note this is a special kind of function which can be on the left side of an assignment statement.

    • The colnames() function can also be used on the right side of an assignment statement. In this case it returns the column names of object.

    The length of vector must match the number of columns of object.

We are going to shorten the names of the variables and take a final look at the structure of the salary data.frame.

  • Enter the following commands in your script and run them.

    colnames(salary) <- c("rank","dscpl","yrSin","yrSer","sex",
                          "salary","logSal","salLev")
    str(salary)
  • The results in the console are

    'data.frame':   397 obs. of  8 variables:
     $ rank  : Factor w/ 3 levels "AssocProf","AsstProf",..: 3 3 2 3 3 1 3 3 3 3 ...
     $ dscpl : Factor w/ 2 levels "A","B": 2 2 2 2 2 2 2 2 2 2 ...
     $ yrSin : int  19 20 4 45 40 6 30 45 21 18 ...
     $ yrSer : int  18 16 3 39 41 6 23 45 20 18 ...
     $ sex   : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 1 ...
     $ salary: num  139.8 173.2 79.8 115 141.5 ...
     $ logSal: num  4.94 5.15 4.38 4.74 4.95 ...
     $ salLev: Factor w/ 3 levels "high","low","middle": 1 1 2 3 1 3 1 1 3 3 ...
  • The variables in this data set will be used in later lessons.

The cleaned up data set is typically not saved if the code to create it runs fairly quickly. If it takes a long time to create the cleaned up data set, the cleaned up data set can be saved as a data file. Then the cleaned up data set can be imported into R for subsequent R sessions. The write.table() function is used to save a data.frame to a file. The use of write.table is similar to read.table and the proper usage of it can be found in the help file. We will not save our cleaned up data set since the code to create it runs quickly. There are also internal storage formats which can be used to save datasets from the workspace. These will not be covered by this series.

We are done with our work on this script for this lesson.

  • Commit the changes you have made to this file with the commitment message "Added data prep to salAnalysis".

Naming conventions

R packages use a variety of naming conventions for variable and function names. Some of these use dots or underscores between words. Other use camel coding (lower case letters except for the first letter in words after the first word, such as numVar.) Words are abbreviated differently, some using the first letters and others using just consonants. While none of these approaches are inherently bad, some do cause problems if the code is used with other systems (periods might have other meaning, etc.) We recommend the following naming conventions for variables.

  • Use at least 2 characters. Never use T or F as variable names. This will overwrite the special definitions they have in R.
  • Use abbreviations where they are clear. For example numObs for the number of observations.
  • Be consistent in your use of abbreviation. If obs is the abbreviation you use for observations, use obs as the abbreviation in all variables where it is used.
  • Use camel code names (avoiding the use of . and _ in names.)

Commit your changes to SalAnalysis.

Exercises

Do the following exercise in the AlfAnalysis script. Use the data set alfalfa.txt you imported above for these exercises.

  1. Change the variable names to "shade","irrig", "inoc", and "yield"

  2. Create a new variable for shade level (shadeLev) from the shade variable by setting 1 to "full", 5 to "none", and the rest to "part".

  3. Change the type of shadeLev to factor.

  4. Include shadeLev in the alfalfa data.frame.

  5. Change inoculate level E to control. This is a more challenging problem.

  6. Commit your changes to AlfAnalysis.

Solutions

Next: Data exploration

Previous: R Scripts

Last Revised: 4/21/2014