Transformations and Actions

Learn the two cornerstone concepts of transformations and actions in Spark.

Two types of operations

After having worked on the previous example projects, we’re better positioned to understand two crucial concepts, and their related operations, involved in any Spark application: transformations and actions.

These two concepts are exposed programmatically by the Spark API’s methods that belong to abstractions such as DataFrames, RDDs, or JavaRDDs, etc.

Transformations

Transformations are the kind of operations that can transform both the structure of a DataFrame and its contents. We’ve applied these two types of operations in previous examples while:

  1. Renaming, dropping, and creating columns of a DataFrame or a Dataset (withColumn(), drop() methods, etc.).

  2. Doing calculations on each row of the DataFrame, whether to add a new column or create a Dataset of POJOs (when we introduced the map() method and related interface MapFunction).

Every time we applied a transformation we also got a new DataFrame as a result. This happens is due to a fundamental property of the abstraction:

  • DataFrames are immutable structures. Explained in practical terms, they are objects that can be read or created but not updated. To obtain a modified version of a Dataframe, a new one is created based on the existing DataFrame’s information after a transformation is applied.

Note: Objects’ Immutability is not a Spark invention. It’s an important and old concept in both functional and object-oriented programming. More information can be found here.

Transformations are also the means of applying business logic to a dataset, and they are usually classified into two different categories.

Narrow transformations

If a transformation operates in partitions of data in which each of the partitions contributes to only a corresponding single output partition, then this transformation is classified as a narrow transformation.

In other words, it means that the operation involving the transformation can be thought of as a 1-to-1 relationship between input partitions and their corresponding output partitions.

One good example of this involves the Map transformation. In this example, each of the rows is mapped or transformed into just one row exactly, as the diagram below shows:

Get hands-on with 1400+ tech skills courses.