Building blocks of a Spark application

In this lesson, we'll formally look at various components of a Spark application. A Spark application consists of one or several jobs. But a Spark job, unlike MapReduce, is much broader in scope.

Each job is made of a directed acyclic graph of stages. A stage is roughly equivalent to a map or reduce phase in MapReduce. A stage is split into tasks by the Spark runtime and is executed in parallel on partitions of an RDD across the cluster. A task can be thought of as the single unit of work or execution that is sent to the Spark executor. Each task maps to a single core and works on a single partition of data. The relationship among these various concepts is depicted below:

A single Spark application can run one or more Spark jobs serially or in parallel. Cached RDDs output from one job can be made available to a second without requiring disk I/O in between. This makes certain computations extremely fast. A job always executes in the context of a Spark application. The spark-shell is an instance of a Spark application.

Let's see an example to better understand jobs, stages and tasks. The example below creates two DataFrames, each consisting of integers from 0 to 9. Next, we transform one of the DataFrames to consist of multiples of three by multiplying each element by 3. Finally, we compute the intersection of the two DataFrames and sum up the result. The Scala code presented below can be executed in the terminal.

Press + to interact

    // Start the Spark shell
    spark-shell
    // Create a dataframe consisting of ints from 0 to 9
    val dataFrame1 = spark.range(10).toDF()
    val dataFrame2 = spark.range(10).toDF()
    // Examine the number of partitions
    dataFrame1.rdd.getNumPartitions
    // Repartition the data
    val transformation1 = dataFrame1.repartition(20)
    val transformation2 = dataFrame1.repartition(20)
    // Create multiples of 3 from dataFrame1
    val mulsOf3 = transformation1.selectExpr("id * 3 as id")
    // Perform an inner join between the two dataframes.
    // transformation1.collect() = [0, 1, 2, 3, 4, 5. 6, 7, 8, 9]
    // mulsOf3.collect() = [0, 3, 6, 9, 12, 15, 18, 21, 24, 27]
    val join = mulsOf3.join(transformation2, "id")
    // Sum the intersection of the two dataFrames now
    // join.collect() = [0, 3, 6]
    val sum = join.selectExpr("sum(id)")
    // Perform an action to get the final result
    val result = sum.collect()
    // Examine the physical plan generated
    sum.explain()

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

Anatomy of a Spark Application

Building blocks of a Spark application

Transformations and actions