Union, UnionByName, and DropDuplicates
Get introduced to Union, UnionByName, and DropDuplicates transformations in this lesson.
We'll cover the following...
Union
The union
transformation allows us to combine two DataFrames, thus producing a new one containing the rows from both.
This operation has the following characteristics:
-
The schemas of both DataFrames have to be identical. This doesn’t detour much from the classical SQL UNION operation available in RDBMS.
-
Duplicate records are preserved and aggregated to the final results.
We are going to first present a graphical representation of this transformation, which illustrates an interesting property that makes union
an attractive transformation in specific scenarios.
The union
transformation merges and piles up one DataFrame after the another. No exchange of information happens between them or, more importantly, their partitions, which are omitted for this reason on the diagram. Also, no sorting is applied before merging the rows.
This fundamentally means that no shuffling occurs in this transformation, and the rows are just collected as they come into a new DataFrame.
Note: Rows are not actually collected as with the
collect()
action and returned back to the driver program. Instead, the operation to perform a union re-assigns the rows to a new DataFrame. In other words, if a DataFrame is an abstraction of rows in the cluster nodes, the result of this transformation is an abstraction of a DataFrame with more rows than the original two DataFrames.
The business domain for this lesson’s project contains information about sales people working for a fictitious company. This company, in turn, owns a couple of offices: a main branch in the United States and a branch in London, in the United Kingdom.
mvn install exec:exec
Note: Due to this transformation outputting a considerable amount of logs (just like in real-life scenarios), we’ve included the following breakpoint messages that require input to ...