Mastering Big Data with Apache Spark and Java/

...

Union, UnionByName, and DropDuplicates

Get introduced to Union, UnionByName, and DropDuplicates transformations in this lesson.

We'll cover the following...

Union
UnionByName
DropDuplicates

Union

The union transformation allows us to combine two DataFrames, thus producing a new one containing the rows from both.

This operation has the following characteristics:

The schemas of both DataFrames have to be identical. This doesn’t detour much from the classical SQL UNION operation available in RDBMS.
Duplicate records are preserved and aggregated to the final results.

We are going to first present a graphical representation of this transformation, which illustrates an interesting property that makes union an attractive transformation in specific scenarios.

The union transformation merges and piles up one DataFrame after the another. No exchange of information happens between them or, more importantly, their partitions, which are omitted for this reason on the diagram. Also, no sorting is applied before merging the rows.

This fundamentally means that no shuffling occurs in this transformation, and the rows are just collected as they come into a new DataFrame.

Note: Rows are not actually collected as with the collect() action and returned back to the driver program. Instead, the operation to perform a union re-assigns the rows to a new DataFrame. In other words, if a DataFrame is an abstraction of rows in the cluster nodes, the result of this transformation is an abstraction of a DataFrame with more rows than the original two DataFrames.

The business domain for this lesson’s project contains information about sales people working for a fictitious company. This company, in turn, owns a couple of offices: a main branch in the United States and a branch in London, in the United Kingdom.

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Union, UnionByName, and DropDuplicates

Union