Merging is the process of combining multiple datasets into a single dataset. Examples include adding inflation factors to panel data to adjust income to today’s rates, or adding county-level statistics to individual-level data.
dplyr and the
library(dplyr) air <- airquality
The actual act of merging two datasets usually involves only one or two lines of code. Most of the work when merging happens before this, in preparing dataframes for merging and in deciding which rows we want in the output.
17.1 Preparing Dataframes for Merging
air dataframe has a numeric vector of month numbers stored in the
 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 6 6 6 6 6 6 6 6 6 6 6 6  6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7  7 7 7 7 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 9 9 9 9 9 9 9 9 9  9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9
As an exercise in merging, we can create a small dataframe with month names and then merge that with the
month.name is a character vector of month names included with R.
 "January" "February" "March" "April" "May" "June" "July"  "August" "September" "October" "November" "December"
How can we add this information to the
air dataframe? We need a key, an identifier that we can use to match one set of information to another. A common example of a key in the social sciences is a person ID, which can be used to link information across multiple datasets.
We first need to turn this character vector into a dataframe:
myMonths <- as.data.frame(month.name)
Now, we can add a key column. A good choice would be the month number, since this is already in the
air dataframe, and it is quite easy to add to
myMonths since the months are already sorted chronologically. Create a new column with the name
Month. Having the same name for the same data in two dataframes makes merging easier.
myMonths <- myMonths %>% mutate(Month = 1:12)
Sometimes you will have data that is in one dataframe but not in another. Maybe an individual dropped out of a study, or their data was lost, or they could not be contacted for a follow-up survey. Whatever the reason, this is a common problem, so for practice, drop June from
filter(). (See the chapter on Subsetting for more about
In other words, the case just described is when a row in one dataset has no matches in another. Another case is when a row in one dataset has multiple matches in the other. To illustrate this, we can add a duplicate of May called “May2” with
myMonths <- myMonths %>% filter(!month.name %in% "June") %>% add_row(month.name = "May2", Month = 5)
Finally, rename the
month.name column. It can be confusing if
myMonths$month.name have the same name but are different vectors with different lengths.
myMonths <- myMonths %>% rename(Month_Name = month.name)
17.2 Merging Two Dataframes
Now that we have a dataframe with month names, we can merge it with our
Base R and the tidyverse have their own functions for merging. Both work well in pipes and provide you options for which rows to keep from the first (“x” or “left”) and second (“y” or “right”) dataframes.
Below are four pairs of equivalent functions, where the first in each pair uses
merge() from base R and the second uses one of the
_join() functions from
dplyr. Note that the default settings of
merge() make it equivalent to
We do not always need to explicitly name our key column (i.e., we can choose to not type
by = "Month") and the functions will figure out the key column on their own. However, it is good practice to do so, to maintain knowledge and control of your code.
air only has months May through September, while
myMonths has everything but June and it has “May” and “May2”. Each dataframe has at least one month that the other does not. Try out each of the functions below and note how they handle:
- June, which is in
airbut not in
- January-April and October-December, which are in
myMonthsbut not in
- May and May2, both of which are recorded as the fifth month in
# rows in x AND y merge(air, myMonths, by = "Month") inner_join(air, myMonths, by = "Month") # rows in x OR y merge(air, myMonths, by = "Month", all = T) full_join(air, myMonths, by = "Month") # rows in x merge(air, myMonths, by = "Month", all.x = T) left_join(air, myMonths, by = "Month") # rows in y merge(air, myMonths, by = "Month", all.y = T) right_join(air, myMonths, by = "Month")
dplyr has three other joining functions not offered by
Two of these are
anti_join(), which merge and subset in one step, do not include columns that exist only in the second dataframe.
semi_join() returns rows from the first dataframe with a match in the second, and
anti_join() returns rows from the first that do not have a match in the second.
These two functions are useful when we work with multiple datasets and we want to quickly limit one by the other. We might have one dataset with individual-level survey data, and another with county-level data, but we do not have county-level data for all individuals. We can quickly find individuals with data in both datasets with
semi_join(). Or, we might have longitudinal data stored in multiple datasets, one for each time point. If we want to analyze individuals who dropped out between time points, we can easily find them with
Of course, we could also do these in two steps, first using one of the other
_join() functions and then using
filter() to find rows that do or do not have values from the second datasets. (For two steps, first use a
_join() function and then
filter(is.na(var_name)). See chapter on Subsetting for using
Using our datasets, the
Month_Name column is not in the output. If any rows in the first dataframe have multiple matches in the second (“5” matches to both “May” and “May2”), only one row is returned in the output, unlike the other
_join() functions above where we have one copy with “May” and another copy with “May2”.
air has months 5-9 and
myMonths has months 1-5 and 7-12. Which months will be in the output of each line below? Make predictions, and then run each line on your own computer.
semi_join(air, myMonths, by = "Month") semi_join(myMonths, air, by = "Month") anti_join(air, myMonths, by = "Month") anti_join(myMonths, air, by = "Month")
The final merging function offered by
nest_join(), which returns a dataframe with an additional list-column of dataframes. Each value in this new column is a dataframe of all matching rows from the second dataframe.
test <- nest_join(air, myMonths, by = "Month") test[][] # May - two matches
# A tibble: 2 x 1 Month_Name <chr> 1 May 2 May2
test[][] # June - no matches
# A tibble: 0 x 1 # ... with 1 variable: Month_Name <chr>
test[][] # July - one match
# A tibble: 1 x 1 Month_Name <chr> 1 July
The column of dataframes can be unpacked, or “unnested”, with the
unnest() function from the
cols argument needs to be given the name of our list-column, which is the name of the second dataframe in our original
keep_empty = F (the default),
unnest() will drop rows without matches in the second dataframe, so we will lose June. This makes the output the same as we got with
keep_empty = T, the rows without matches will be retained, and we will end up with the dataframe we got with
library(tidyr) unnest(test, cols = myMonths) unnest(test, cols = myMonths, keep_empty = T)
17.2.1 Merging Two Dataframes with Different Key Column Names
Sometimes your key variable will have different names in your dataframes. One option would be to rename one to match the other, as we did above when we created a
Month column in
myMonths. Another choice is to use the
by arguments of each function, as shown below. Change the name of the key variable in
myMonths <- myMonths %>% rename(month_number = Month) merge(air, myMonths, by.x = "Month", by.y = "month_number") inner_join(air, myMonths, by = c("Month" = "month_number"))
17.3 Appending Rows
Merging combines the columns of two dataframes. To append (add) rows from one or more dataframes to another, use the
bind_rows() function from
This function is especially useful in combining survey responses from different individuals.
bind_rows() will match columns by name, so the dataframes can have different numbers and names of columns and rows.
# fake survey responses survey001 <- as.data.frame(t(c(1, 1, "Sometimes", 6, 7))) survey002 <- as.data.frame(t(c(4, 3, 1, 1))) survey003 <- as.data.frame(t(c(6, 1, 2, 2, 4, "Never"))) # question names, in different orders colnames(survey001) <- c("Q1", "Q2", "Q2a", "Q3", "Q4") colnames(survey002) <- c("Q3", "Q4", "Q1", "Q2") colnames(survey003) <- c("Q3", "Q2", "Q2a", "Q1", "Q4", "Q4a") bind_rows(survey001, survey002, survey003)
Error: Can't combine `..1$Q1` <character> and `..2$Q1` <double>.
Uh-oh. When combining columns from different dataframes, you generally will not see errors like this, since columns can contain different types. However, each column can contain only one type of data, so rows must be the same data type.
You will commonly get errors when adding rows from one dataframe to another. If you have a manageable number of dataframes, as we do here, you can manually convert the columns before combining. Here, if we run
str() for the three dataframes, we see that all columns are characters in
survey003, while all columns in
survey002 are numeric.
'data.frame': 1 obs. of 5 variables: $ Q1 : chr "1" $ Q2 : chr "1" $ Q2a: chr "Sometimes" $ Q3 : chr "6" $ Q4 : chr "7"
'data.frame': 1 obs. of 4 variables: $ Q3: num 4 $ Q4: num 3 $ Q1: num 1 $ Q2: num 1
'data.frame': 1 obs. of 6 variables: $ Q3 : chr "6" $ Q2 : chr "1" $ Q2a: chr "2" $ Q1 : chr "2" $ Q4 : chr "4" $ Q4a: chr "Never"
Coerce all columns in
survey002 to character, and then try
survey002 <- apply(survey002, 2, as.character) surveys <- bind_rows(survey001, survey002, survey003) surveys
Q1 Q2 Q2a Q3 Q4 Q4a 1 1 1 Sometimes 6 7 <NA> 2 1 1 <NA> 4 3 <NA> 3 2 1 2 6 4 Never
Even though the columns of the individual dataframes were in different orders, they were re-ordered to match one another. Wherever a column did not exist in one of the dataframes (see Q2a and Q4a), the column was populated with
While coercing the columns to characters did allow us to bind the rows of all our dataframes, we probably want some columns to be different types for a later analysis. Right now, we cannot calculate
mean(surveys$Q1) since it is a character vector.
Warning in mean.default(surveys$Q1): argument is not numeric or logical: returning NA
We could go through each column and convert the type (
surveys$Q1 <- as.numeric(surveys$Q1), etc.), or we could make use of
type_convert() from the
readr package. This function guesses the “true” type of each column and converts it for us, letting us know which type it chose for each column.
library(readr) surveys <- type_convert(surveys)
-- Column specification ---------------------------------------------------------------------- cols( Q1 = col_double(), Q2 = col_double(), Q2a = col_character(), Q3 = col_double(), Q4 = col_double(), Q4a = col_character() )
17.4 Merging Exercises
- Merge the
beaver2datasets on the
timecolumn, and include all rows.
- Bonus: Add a column called
warmerthat has a 1 whenever beaver1’s temperature was higher, and a 2 whenever beaver2’s temperature was higher. Ignore missing values. (If beaver1 is missing and beaver2 is not, the new column should say 2.) If the two beavers’ temperatures are equal, make the value of
Append the rows of
beaver1. Make sure there is a column that specifies the beaver number (1 or 2) for each observation.
state.nameinto a single dataframe.
- Drop all rows where the state name is more than one word.
- Merge this dataframe with
- Bonus: Add two columns corresponding to the two objects in the list