...

/

Anatomy of a Spark Application

Anatomy of a Spark Application

This lesson explains the constituents of a Spark job.

We'll cover the following...

Anatomy a Spark Application

In this lesson, we’ll formally look at various components of a Spark job. A Spark application consists of one or several jobs. But a Spark job, unlike MapReduce, is much broader in scope. Each job is made of a directed acyclic graph of stages. A stage is roughly equivalent to a map or reduce phase in MapReduce. A stage is split into tasks by the Spark runtime and executed in parallel on partitions of an RDD across the cluster. The relationship among these various concepts is depicted below:

A single Spark application can run one or more Spark jobs serially or in parallel. Cached RDDs output from one job can be made available to a second without requiring disk I/O in between. This makes certain computations extremely fast. A job always executes in the context of a Spark application. The spark-shell is an instance of a Spark application.

Let’s see an example to better understand jobs, stages and tasks. Consider the example below; it creates two DataFrames, each consisting of integers from 0 to 9. Next, we transform one of the DataFrames to consist of multiples of 3 by multiplying each element with 3. Finally, we compute the intersection of the two DataFrames and sum up the result. The Scala code, presented below can be executed in the terminal.

// Start the Spark shell
spark-shell
// Create a dataframe consisting of ints from 0 to 9
val dataFrame1 = spark.range(10).toDF()
val dataFrame2 = spark.range(10).toDF()
// Examine the number of partitions
dataFrame1.rdd.getNumPartitions
// Repartition the data
val transformation1 = dataFrame1.repartition(20)
val transformation2 = dataFrame2.repartition(20)
// Create multiples of 3 from dataFrame1
val mulsOf3 = transformation1.selectExpr("id * 3 as id")
// Perform an inner join between the two dataframes.
// transformation1.collect() = [0, 1, 2, 3, 4, 5. 6, 7, 8, 9]
// mulsOf3.collect() = [0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
val join = mulsOf3.join(transformation2, "id")
// Sum the intersection of the two dataFrames now
// join.collect() = [0, 3, 6]
val sum = join.selectExpr("sum(id)")
// Perform an action to get the final result
val result = sum.collect()
// Examine the physical plan generated
sum.explain()
Terminal 1
Terminal
Loading...

You need to draw a distinction between transformations and actions in a Spark job. All the statements before line#28 sum.collect() are transformations. Transformations are at the core of expressing your business logic using Spark. ...