Anatomy of a Spark Application
This lesson explains the constituents of a Spark job.
We'll cover the following...
Anatomy a Spark Application
In this lesson, we’ll formally look at various components of a Spark job. A Spark application consists of one or several jobs. But a Spark job, unlike MapReduce, is much broader in scope. Each job is made of a directed acyclic graph of stages. A stage is roughly equivalent to a map or reduce phase in MapReduce. A stage is split into tasks by the Spark runtime and executed in parallel on partitions of an RDD across the cluster. The relationship among these various concepts is depicted below:
A single Spark application can run one or more Spark jobs serially or in parallel. Cached RDDs output from one job can be made available to a second without requiring disk I/O in between. This makes certain computations extremely fast. A job always executes in the context of a Spark application. The spark-shell is an instance of a Spark application.
Let’s see an example to better understand jobs, stages and tasks. Consider the example below; it creates two DataFrames, each consisting of integers from 0 to 9. Next, we transform one of the DataFrames to consist of multiples of 3 by multiplying each element with 3. Finally, we compute the intersection of the two DataFrames and sum up the result. The Scala code, presented below can be executed in the terminal.
// Start the Spark shellspark-shell// Create a dataframe consisting of ints from 0 to 9val dataFrame1 = spark.range(10).toDF()val dataFrame2 = spark.range(10).toDF()// Examine the number of partitionsdataFrame1.rdd.getNumPartitions// Repartition the dataval transformation1 = dataFrame1.repartition(20)val transformation2 = dataFrame2.repartition(20)// Create multiples of 3 from dataFrame1val mulsOf3 = transformation1.selectExpr("id * 3 as id")// Perform an inner join between the two dataframes.// transformation1.collect() = [0, 1, 2, 3, 4, 5. 6, 7, 8, 9]// mulsOf3.collect() = [0, 3, 6, 9, 12, 15, 18, 21, 24, 27]val join = mulsOf3.join(transformation2, "id")// Sum the intersection of the two dataFrames now// join.collect() = [0, 3, 6]val sum = join.selectExpr("sum(id)")// Perform an action to get the final resultval result = sum.collect()// Examine the physical plan generatedsum.explain()
You need to draw a distinction between transformations and actions in a Spark job. All the statements before line#28 sum.collect()
are transformations. Transformations are at the core of expressing your business logic using Spark. ...