Joins
Relationships and subsequent reunions of data produce meaningful information, and Spark provides an operation to achieve the Join transformation.
The Join transformation
To join information on different datasets, Spark provides the Join transformation, similar to the join operation on relational databases.
And like on those databases, there are different types of joins that can be performed between separate yet related datasets.
Let’s introduce some Venn diagrams along with our code examples.
Types of Joins
In this course, we’re going to focus on the following Joins: Inner, Self, Left Outer, Right Outer, and Full Outer.
Join is a wide transformation, so the likely possibility of data shuffling has to be taken into account when using it.
The general structure of Join
The following code snippet, that belongs to the joins project located in the widget down this lesson, presents the general structure of a join transformation’s syntax at the code level.
public Dataset<Row> join(final Dataset<?> right, final Column joinExprs, final String joinType)
The method’s signature (part of the Spark DataFrame API) is almost self-explanatory and expects the following arguments:
-
A right dataset (or generically typed DataFrame). This is the second DataFrame to join with.
-
The join expression. This is a column-based type of expression that allows specifying the Join criteria.
-
The type of Join to be performed.
There are overloaded versions of this method, but it is useful to keep this one as the foundation to help understand the rest of them.
In this lesson’s project, we’re working with a trivial example of relationships between datasets. One dataset holds employee information,whereas the second (related) dateset holds departmental information.
mvn install exec:exec
Both sets of information are loaded as usual through the session.read()
method, and all their records are displayed with the customary show()
action.
We can skip the argument to print the first records, because these datasets are very small.
The employee’s DataFrame has the following information and structure:
+------+---------------+--------------+----------+-----------+------+------+
|emp_id| name|manager_emp_id|start_year|emp_dept_id|gender|salary|
+------+---------------+--------------+----------+-----------+------+------+
| 1| John Smith| -1| 2011| 10| M| 4000|
| 2| Julia Leeds| 1| 2014| 20| F| 3000|
| 3|Paul Chancellor| 1| 2013| 20| M| 2000|
| 4| Juan Perez| 2| 2013| 40| M| 3000|
| 5| Robert O'Neill| 2|
...