The main focus of this lesson is to know how we would perform operations on an RDD's workers in parallel to transform it into another RDD, and how we would extract information from these distributed datasets. Spark provides parallel operations solely for this purpose. The users don't have to extract or transform data from each worker separately. The Spark system applies each function simultaneously across all the workers in an RDD. Parallel operations can transform RDDs to get new RDDs. There are generally two types of operations we can perform on RDDs––transformations and actions.

Transformations

These are the operations applied on an RDD to get a new RDD. Transformations are lazy operations, i.e., they get executed only when an action is called. Instead of modifying the data immediately, Spark waits until action is called and builds an execution plan to make all the transformations run efficiently whenever they are executed, possibly pipelining many transformations. Since RDDs are immutable, the input RDD remains the same. Spark supports many transformations, such as map(), flatMap(), mapValues(), filter(), groupByKey(), reduceByKey(), union(), join(), cogroup(), crossProduct(), sample(), partitionBy(), and sort(). Making another RDD from an RDD and then applying a transformation on it again to get another RDD makes a transformation chain or pipeline. Spark provides a graph-based representation for RDDs called a lineage graph to track the lineage of transformations.

The lineage graph shown below contains a series of transformations in MMA fights. First, UFC fights are filtered out from the data, then the winners of each fight are mapped with an integer 1. Finally, all the wins of each fighter are reduced to give out the total wins of each fighter.

Level up your interview prep. Join Educative to access 80+ hands-on prep courses.