Mastering Big Data with Apache Spark and Java/

...

Joins

Relationships and subsequent reunions of data produce meaningful information, and Spark provides an operation to achieve the Join transformation.

We'll cover the following...

The Join transformation
- Types of Joins

The Join transformation

To join information on different datasets, Spark provides the Join transformation, similar to the join operation on relational databases.

And like on those databases, there are different types of joins that can be performed between separate yet related datasets.

Let’s introduce some Venn diagrams along with our code examples.

Types of Joins

In this course, we’re going to focus on the following Joins: Inner, Self, Left Outer, Right Outer, and Full Outer.

Join is a wide transformation, so the likely possibility of data shuffling has to be taken into account when using it.

The general structure of Join

The following code snippet, that belongs to the joins project located in the widget down this lesson, presents the general structure of a join transformation’s syntax at the code level.

public Dataset<Row> join(final Dataset<?> right, final Column joinExprs, final String joinType)

The method’s signature (part of the Spark DataFrame API) is almost self-explanatory and expects the following arguments:

A right dataset (or generically typed DataFrame). This is the second DataFrame to join with.
The join expression. This is a column-based type of expression that allows specifying the Join criteria.
The type of Join to be performed.

There are overloaded versions of this method, but it is useful to keep this one as the foundation to help understand the rest of them.

In this lesson’s project, we’re working with a trivial example of relationships between datasets. One dataset holds employee information,whereas the second (related) dateset holds departmental information.

Both sets of information are loaded as usual through the session.read() method, and all their records are displayed with the customary show() action.

We can skip the argument to print the first records, because these datasets are very small.

The employee’s DataFrame has the following information and structure:

+------+---------------+--------------+----------+-----------+------+------+
|emp_id|           name|manager_emp_id|start_year|emp_dept_id|gender|salary|
+------+---------------+--------------+----------+-----------+------+------+
|     1|     John Smith|            -1|      2011|         10|     M|  4000|
|     2|    Julia Leeds|             1|      2014|         20|     F|  3000|
|     3|Paul Chancellor|             1|      2013|         20|     M|  2000|
|     4|     Juan Perez|             2|      2013|         40|     M|  3000|
|     5| Robert O'Neill|             2|      2018|         20|     M|  1000|
+------+---------------+--------------+----------+-----------+------+--

...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Joins

The Join transformation

Types of Joins

The general structure of Join