17 Merging

Merging is the process of combining multiple datasets into a single dataset. Examples include adding inflation factors to panel data to adjust income to today’s rates, or adding county-level statistics to individual-level data.

Load dplyr and the airquality dataset.

library(dplyr)

air <- airquality

The actual act of merging two datasets usually involves only one or two lines of code. Most of the work when merging happens before this, in preparing dataframes for merging and in deciding which rows we want in the output.

17.1 Preparing Dataframes for Merging

The air dataframe has a numeric vector of month numbers stored in the Month column.

air$Month
  [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6
 [45] 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7
 [89] 7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9
[133] 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9

As an exercise in merging, we can create a small dataframe with month names and then merge that with the air dataframe.

month.name is a character vector of month names included with R.

month.name
 [1] "January"   "February"  "March"     "April"     "May"       "June"      "July"     
 [8] "August"    "September" "October"   "November"  "December" 

How can we add this information to the air dataframe? We need a key, an identifier that we can use to match one set of information to another. A common example of a key in the social sciences is a person ID, which can be used to link information across multiple datasets.

We first need to turn this character vector into a dataframe:

myMonths <- as.data.frame(month.name)

Now, we can add a key column. A good choice would be the month number, since this is already in the air dataframe, and it is quite easy to add to myMonths since the months are already sorted chronologically. Create a new column with the name Month. Having the same name for the same data in two dataframes makes merging easier.

myMonths <- 
  myMonths %>% 
  mutate(Month = 1:12)

Sometimes you will have data that is in one dataframe but not in another. Maybe an individual dropped out of a study, or their data was lost, or they could not be contacted for a follow-up survey. Whatever the reason, this is a common problem, so for practice, drop June from myMonths with filter(). (See the chapter on Subsetting for more about filter().)

In other words, the case just described is when a row in one dataset has no matches in another. Another case is when a row in one dataset has multiple matches in the other. To illustrate this, we can add a duplicate of May called “May2” with add_row().

myMonths <-
  myMonths %>% 
  filter(!month.name %in% "June") %>% 
  add_row(month.name = "May2", Month = 5)

Finally, rename the month.name column. It can be confusing if month.name and myMonths$month.name have the same name but are different vectors with different lengths.

myMonths <-
  myMonths %>% 
  rename(Month_Name = month.name)

17.2 Merging Two Dataframes

Now that we have a dataframe with month names, we can merge it with our air dataframe.

Base R and the tidyverse have their own functions for merging. Both work well in pipes and provide you options for which rows to keep from the first (“x” or “left”) and second (“y” or “right”) dataframes.

Below are four pairs of equivalent functions, where the first in each pair uses merge() from base R and the second uses one of the _join() functions from dplyr. Note that the default settings of merge() make it equivalent to inner_join().

We do not always need to explicitly name our key column (i.e., we can choose to not type by = "Month") and the functions will figure out the key column on their own. However, it is good practice to do so, to maintain knowledge and control of your code.

Remember, air only has months May through September, while myMonths has everything but June and it has “May” and “May2”. Each dataframe has at least one month that the other does not. Try out each of the functions below and note how they handle:

  • June, which is in air but not in myMonths
  • January-April and October-December, which are in myMonths but not in air
  • May and May2, both of which are recorded as the fifth month in myMonths
# rows in x AND y
merge(air, myMonths, by = "Month")
inner_join(air, myMonths, by = "Month")

# rows in x OR y
merge(air, myMonths, by = "Month", all = T)
full_join(air, myMonths, by = "Month")

# rows in x
merge(air, myMonths, by = "Month", all.x = T)
left_join(air, myMonths, by = "Month")

# rows in y
merge(air, myMonths, by = "Month", all.y = T)
right_join(air, myMonths, by = "Month")

dplyr has three other joining functions not offered by merge().

Two of these are semi_join() and anti_join(), which merge and subset in one step, do not include columns that exist only in the second dataframe. semi_join() returns rows from the first dataframe with a match in the second, and anti_join() returns rows from the first that do not have a match in the second.

These two functions are useful when we work with multiple datasets and we want to quickly limit one by the other. We might have one dataset with individual-level survey data, and another with county-level data, but we do not have county-level data for all individuals. We can quickly find individuals with data in both datasets with semi_join(). Or, we might have longitudinal data stored in multiple datasets, one for each time point. If we want to analyze individuals who dropped out between time points, we can easily find them with anti_join().

Of course, we could also do these in two steps, first using one of the other _join() functions and then using filter() to find rows that do or do not have values from the second datasets. (For two steps, first use a _join() function and then filter(!is.na(var_name)) or filter(is.na(var_name)). See chapter on Subsetting for using filter().)

Using our datasets, the Month_Name column is not in the output. If any rows in the first dataframe have multiple matches in the second (“5” matches to both “May” and “May2”), only one row is returned in the output, unlike the other _join() functions above where we have one copy with “May” and another copy with “May2”.

air has months 5-9 and myMonths has months 1-5 and 7-12. Which months will be in the output of each line below? Make predictions, and then run each line on your own computer.

semi_join(air, myMonths, by = "Month")
semi_join(myMonths, air, by = "Month")

anti_join(air, myMonths, by = "Month")
anti_join(myMonths, air, by = "Month")

The final merging function offered by dplyr is nest_join(), which returns a dataframe with an additional list-column of dataframes. Each value in this new column is a dataframe of all matching rows from the second dataframe.

test <- nest_join(air, myMonths, by = "Month")

test[[7]][[1]]  # May - two matches
# A tibble: 2 x 1
  Month_Name
  <chr>     
1 May       
2 May2      
test[[7]][[32]] # June - no matches
# A tibble: 0 x 1
# ... with 1 variable: Month_Name <chr>
test[[7]][[62]] # July - one match
# A tibble: 1 x 1
  Month_Name
  <chr>     
1 July      

The column of dataframes can be unpacked, or “unnested”, with the unnest() function from the tidyr package.

The cols argument needs to be given the name of our list-column, which is the name of the second dataframe in our original nest_join() call.

If keep_empty = F (the default), unnest() will drop rows without matches in the second dataframe, so we will lose June. This makes the output the same as we got with inner_join() earlier.

If keep_empty = T, the rows without matches will be retained, and we will end up with the dataframe we got with full_join().

library(tidyr)

unnest(test, cols = myMonths)

unnest(test, cols = myMonths, keep_empty = T)

17.2.1 Merging Two Dataframes with Different Key Column Names

Sometimes your key variable will have different names in your dataframes. One option would be to rename one to match the other, as we did above when we created a Month column in myMonths. Another choice is to use the by arguments of each function, as shown below. Change the name of the key variable in myMonths from Month to month_number.

myMonths <- 
  myMonths %>% 
  rename(month_number = Month)

merge(air, myMonths, by.x = "Month", by.y = "month_number")
inner_join(air, myMonths, by = c("Month" = "month_number"))

17.3 Appending Rows

Merging combines the columns of two dataframes. To append (add) rows from one or more dataframes to another, use the bind_rows() function from dplyr.

This function is especially useful in combining survey responses from different individuals. bind_rows() will match columns by name, so the dataframes can have different numbers and names of columns and rows.

# fake survey responses
survey001 <- as.data.frame(t(c(1, 1, "Sometimes", 6, 7)))
survey002 <- as.data.frame(t(c(4, 3, 1, 1)))
survey003 <- as.data.frame(t(c(6, 1, 2, 2, 4, "Never")))

# question names, in different orders
colnames(survey001) <- c("Q1", "Q2", "Q2a", "Q3", "Q4")
colnames(survey002) <- c("Q3", "Q4", "Q1", "Q2")
colnames(survey003) <- c("Q3", "Q2", "Q2a", "Q1", "Q4", "Q4a")

bind_rows(survey001, survey002, survey003)
Error: Can't combine `..1$Q1` <character> and `..2$Q1` <double>.

Uh-oh. When combining columns from different dataframes, you generally will not see errors like this, since columns can contain different types. However, each column can contain only one type of data, so rows must be the same data type.

You will commonly get errors when adding rows from one dataframe to another. If you have a manageable number of dataframes, as we do here, you can manually convert the columns before combining. Here, if we run str() for the three dataframes, we see that all columns are characters in survey001 and survey003, while all columns in survey002 are numeric.

str(survey001)
'data.frame':   1 obs. of  5 variables:
 $ Q1 : chr "1"
 $ Q2 : chr "1"
 $ Q2a: chr "Sometimes"
 $ Q3 : chr "6"
 $ Q4 : chr "7"
str(survey002)
'data.frame':   1 obs. of  4 variables:
 $ Q3: num 4
 $ Q4: num 3
 $ Q1: num 1
 $ Q2: num 1
str(survey003)
'data.frame':   1 obs. of  6 variables:
 $ Q3 : chr "6"
 $ Q2 : chr "1"
 $ Q2a: chr "2"
 $ Q1 : chr "2"
 $ Q4 : chr "4"
 $ Q4a: chr "Never"

Coerce all columns in survey002 to character, and then try bind_rows() again.

survey002 <- apply(survey002, 2, as.character)

surveys <- bind_rows(survey001, survey002, survey003)

surveys
  Q1 Q2       Q2a Q3 Q4   Q4a
1  1  1 Sometimes  6  7  <NA>
2  1  1      <NA>  4  3  <NA>
3  2  1         2  6  4 Never

Even though the columns of the individual dataframes were in different orders, they were re-ordered to match one another. Wherever a column did not exist in one of the dataframes (see Q2a and Q4a), the column was populated with NA.

While coercing the columns to characters did allow us to bind the rows of all our dataframes, we probably want some columns to be different types for a later analysis. Right now, we cannot calculate mean(surveys$Q1) since it is a character vector.

mean(surveys$Q1)
Warning in mean.default(surveys$Q1): argument is not numeric or logical: returning NA
[1] NA

We could go through each column and convert the type (surveys$Q1 <- as.numeric(surveys$Q1), etc.), or we could make use of type_convert() from the readr package. This function guesses the “true” type of each column and converts it for us, letting us know which type it chose for each column.

library(readr)

surveys <- type_convert(surveys)

-- Column specification ----------------------------------------------------------------------
cols(
  Q1 = col_double(),
  Q2 = col_double(),
  Q2a = col_character(),
  Q3 = col_double(),
  Q4 = col_double(),
  Q4a = col_character()
)
mean(surveys$Q1)
[1] 1.333333

17.4 Merging Exercises

  1. Merge the beaver1 and beaver2 datasets on the time column, and include all rows.
  • Bonus: Add a column called warmer that has a 1 whenever beaver1’s temperature was higher, and a 2 whenever beaver2’s temperature was higher. Ignore missing values. (If beaver1 is missing and beaver2 is not, the new column should say 2.) If the two beavers’ temperatures are equal, make the value of warmer missing.
  1. Append the rows of beaver2 to beaver1. Make sure there is a column that specifies the beaver number (1 or 2) for each observation.

  2. Combine state.abb, state.division, and state.name into a single dataframe.

  • Drop all rows where the state name is more than one word.
  • Merge this dataframe with state.x77.
  • Bonus: Add two columns corresponding to the two objects in the list state.center.