...

/

Transformations (II): FlatMap and Distinct

Transformations (II): FlatMap and Distinct

Get introduced to the second set of basic transformations.

We'll cover the following...

FlatMap

The FlatMap operation is an old resident of the functional programming paradigm realm. It can be tricky to understand conceptually. There are two key components to learn about regarding FlatMap’s purposes:

  1. Being a map transformation in nature, it applies a function to each element of a collection. This is no different than the plain map() function described before.

  2. If the input is a collection of collections of elements (say a List of Lists, an array of arrays), it flattens the results into a single collection.

So fundamentally, objects are transformed in map and flatMap operations based on a function, but how the elements are processed differs. The former processes a single collection while the latter processes nested collections.

In Spark, the concept is similar, but it displays some differences, so let’s start by visualizing this graphically and then practicing it in the code example.

Before diving into the code snippets, let’s draw an analogy of flatMap between Java API and Spark API.

In Java, flatMap flattens or expands into a collection of single elements. For instance, a list of all sub collections [{1,2} {3,4}, {5,6}] would be flattened to [1,2,3,4,5,6]. Moreover, if the map component of a function is applied, meaning if a number is added to each of the nested collections, then the result would be [2,3,4,5,6,7].

The same occurs in the homologous Spark transformation, but it also provides the option of applying this operation only to specific columns that hold collections within them. This is represented in the diagram by ‘Col2’ or the second column in each row of the DataFrame.

Furthermore, this operation is carried out row by row, until each Dataset’s row is flattened. So, a good way of looking at the rows then is to see them each as a collection of elements.

Let’s take a look closely at the situation explained in the previous subsection to clear any doubts we might have.

Note: Project Interactivity Disclaimer: Because some transformations might produce a bit of extra output (i.e. Distinct) while working on the partitions of a DataFrame, this project includes break points in the code, so please be attentive at the output screen while running the application, as it requires user input to display all the operations in steps.

mvn install exec:exec
Project for FlatMap and Distinct transformations

The scenario

A common scenario while working in batch applications might ...