`merge`

Basics

merge is a function that can be used to combine data.frames by row or column names. The effects of the function replicate the join operations in SQL.

Video Example 1 — Merging data.frames with `fars`

Video Explanation

dat <- read.csv("/class/datamine/data/fars/7581.csv")
state_names <- read.csv("/class/datamine/data/fars/states.csv")

We look at the heads of both data frames.

head(dat)
head(state_names)

The STATE column of dat corresponds to the code column of state_names. We can merge the two on these columns and call the result mynewDF.

mynewDF <- merge(dat, state_names, by.x="STATE", by.y="code")

The new column state (STATE contains the code, state contains the name) is appended to the merged data.frame and stored as mynewDF. We can now calculate fields of interest using enhanced data from merge. How about the average count of drunk drivers by state?

sort(tapply(mynewDF$DRUNK_DR, mynewDF$state, mean))

Video Example 2 — Merging data.frames with `flights`

Video explanation

myDF <- read.csv("/class/datamine/data/flights/subset/1995.csv")

Here we use flight data from 1995. Origin and Dest are the flight takeoff and landing locations, but we only have the airport codes in terms of information on those airports. Airport information is stored in another data.frame:

airportsDF <- read.csv("/class/datamine/data/flights/subset/airports.csv")

We see that the alphanumeric codes in Origin and Dest are contained in iata within the airports data.frame.

To display the information about the Origin airport, we can use merge and link Origin in myDF with iata in airportsDF:

mynewDF <- merge(myDF, airportsDF, by.x="Origin", by.y="iata")

The resulting data frame has the same size as myDF and now has extra columns with information about the Origin airport. Once again, we can use tapply on our enhanced data.frame; how about summing the distances of all flights departing from each state?

sort(tapply( mynewDF$Distance, mynewDF$state, sum ))

`books` and `author` Example

For the following exercises, we use the data.frames books and authors:

books

   id                                     title author_id rating
1   1     Harry Potter and the Sorcerers Stone          1   4.47
2   2   Harry Potter and the Chamber of Secrets         1   4.43
3   3  Harry Potter and the Prisoner of Azkaban         1   4.57
4   4       Harry Potter and the Goblet of Fire         1   4.56
5   5 Harry Potter and the Order of the Phoenix         1   4.50
6   6    Harry Potter and the Half Blood Prince         1   4.57
7   7      Harry Potter and the Deathly Hallows         1   4.62
8   8                          The Way of Kings         2   4.64
9   9                            The Book Thief         3   4.37
10 10                      The Eye of the World         4   4.18

authors

   id                     name avg_rating
1   1             J.K. Rowling       4.46
2   2        Brandon Sanderson       4.39
3   3             Markus Zusak       4.34
4   4            Robert Jordan       4.18
5   5          Agatha Christie       4.00
6   6                Alex Kava       4.02
7   7    Nassim Nicholas Taleb       3.99
8   8              Neil Gaiman       4.13
9   9            Stieg Larsson       4.16
10 10 Antoine de Saint-Exupéry       4.30

If I merge the data.frames on `author_id` in `books` and `id` in `authors`, how do I ensure only perfect matches (no `NA` entries) are returned?

Click to see solution

# In SQL this is referred to as an INNER JOIN.
merge(books, authors, by.x="author_id", by.y="id", all=F)

   author_id id                                     title rating
1          1  1     Harry Potter and the Sorcerers Stone    4.47
2          1  2   Harry Potter and the Chamber of Secrets   4.43
3          1  3  Harry Potter and the Prisoner of Azkaban   4.57
4          1  4       Harry Potter and the Goblet of Fire   4.56
5          1  5 Harry Potter and the Order of the Phoenix   4.50
6          1  6    Harry Potter and the Half Blood Prince   4.57
7          1  7      Harry Potter and the Deathly Hallows   4.62
8          2  8                          The Way of Kings   4.64
9          3  9                            The Book Thief   4.37
10         4 10                      The Eye of the World   4.18
                name avg_rating
1       J.K. Rowling       4.46
2       J.K. Rowling       4.46
3       J.K. Rowling       4.46
4       J.K. Rowling       4.46
5       J.K. Rowling       4.46
6       J.K. Rowling       4.46
7       J.K. Rowling       4.46
8  Brandon Sanderson       4.39
9       Markus Zusak       4.34
10     Robert Jordan       4.18

If I merge the data.frames on `author_id` in `books` and `id` in `authors`, how do I keep author information even if they have no books in `books`?

Click to see solution

# since `authors` is the second data.frame listed, this is a RIGHT JOIN in SQL.
# If `authors` were listed first, it would be a LEFT JOIN and we would only need to swap `x` and `y`.
merge(books, authors, by.x="author_id", by.y="id", all.y=T)

   author_id id                                     title rating
1          1  1     Harry Potter and the Sorcerers Stone    4.47
2          1  2   Harry Potter and the Chamber of Secrets   4.43
3          1  3  Harry Potter and the Prisoner of Azkaban   4.57
4          1  4       Harry Potter and the Goblet of Fire   4.56
5          1  5 Harry Potter and the Order of the Phoenix   4.50
6          1  6    Harry Potter and the Half Blood Prince   4.57
7          1  7      Harry Potter and the Deathly Hallows   4.62
8          2  8                          The Way of Kings   4.64
9          3  9                            The Book Thief   4.37
10         4 10                      The Eye of the World   4.18
11         5 NA                                      <NA>     NA
12         6 NA                                      <NA>     NA
13         7 NA                                      <NA>     NA
14         8 NA                                      <NA>     NA
15         9 NA                                      <NA>     NA
16        10 NA                                      <NA>     NA
                       name avg_rating
1              J.K. Rowling       4.46
2              J.K. Rowling       4.46
3              J.K. Rowling       4.46
4              J.K. Rowling       4.46
5              J.K. Rowling       4.46
6              J.K. Rowling       4.46
7              J.K. Rowling       4.46
8         Brandon Sanderson       4.39
9              Markus Zusak       4.34
10            Robert Jordan       4.18
11          Agatha Christie       4.00
12                Alex Kava       4.02
13    Nassim Nicholas Taleb       3.99
14              Neil Gaiman       4.13
15            Stieg Larsson       4.16
16 Antoine de Saint-Exupéry       4.30

merge

Basics

Video Example 1 — Merging data.frames with fars

Video Example 2 — Merging data.frames with flights

books and author Example

If I merge the data.frames on author_id in books and id in authors, how do I ensure only perfect matches (no NA entries) are returned?

If I merge the data.frames on author_id in books and id in authors, how do I keep author information even if they have no books in books?