Joins and Lookups

Learn to combine data coming from multiple tibbles in the tidyverse using different types of joins.

One of the most common data-cleaning operations that data scientists face is combining multiple data sources. Often in data science, we receive separate datasets that need to be combined. The datasets might come from multiple files, database tables, or even distinct databases altogether, but the need to combine data in some form is a common one.

When we join two datasets, we augment one or both input tables with additional columns or rows from the other. We can add columns or rows, depending on the type of join performed. This is achieved by leveraging a common key column across both input tables. When the key columns match, we augment the matching records with additional data from the other table to form a final combined table.

Types of joins

Before we get into the actual coding of joins, it’s essential to understand that we can carry out four different types of join in R. The different types of joins lead to different behavior when a match is found. Each type of join is helpful in particular scenarios, and using the wrong kind of join can have disastrous consequences! Projects can be completely derailed if the wrong type of join is used early in the project, especially when the mistake can go unnoticed for weeks.

The four types of join in R are:

  1. full_join: This join combines two input tibbles by adding both their columns and rows together. It’s called full because the final output table will contain every record from both input tables. If an id in the key column exists in Table A but not Table B, then the full_join of Table A and Table B will still contain that record. The data columns unique to Table B will be filled in as blank or N/A. Similarly, the data columns unique to Table A will be filled in as blank or
...