Architecture

Get insights on the architecture of Spark.

Spark design

Spark is a distributed parallel data-processing framework and bears many similarities to the traditional MapReduce framework. Spark has the same leader-worker architecture as MapReduce, the leader process coordinates and distributes work to be performed among work processes. These two kinds of processes are formally called the driver and the executor.

Driver

The driver is the leader process that manages the execution of a Spark job. It is responsible for maintaining the overall state of the Spark application, responding to a user's program or input and analyzing, distributing and scheduling work among executor processes. The driver process is in essence the heart of the Spark application and maintains all application related information during an application's lifetime.

Spark Driver converts Spark operations into DAG computations and schedules and distributes them as tasks across the Spark executors. The Spark Driver accesses the distributed components in the cluster, including the executors and the cluster manager, via the SparkSession. You can consider the SparkSession to be a single point of entry and access to all Spark operations and data. Through SparkSession we can read from data sources, write DataFrames or Datasets, create runtime JVM params, etc. In essence, SparkSession is the unified conduit to all of Spark functionality. If we are using the interactive spark-shell, the Spark driver instantiates the SparkSession for us, whereas if we are in a Spark application, we’ll create the SparkSession ourselves. We’ll look at examples of both in the lessons ahead.

Executor

Executors are the worker processes that execute the code assigned to them by the driver process and report the state of the computation on that executor back to the driver.

Once the resources have been allocated, the Driver directly communicates with the executors. In most deployment modes a single executor runs per node. Spark executors are assigned tasks that require working on a subset of data located closest to them in the cluster. Working on data in close proximity is referred to as data locality and helps reduce the consumption of network bandwidth.

Get hands-on with 1400+ tech skills courses.