Deep Dive: Internals of Spark Execution
This lesson expands the previous landscape for a Spark Execution flow and Architecture.
In previous lessons, we used diagrams to slowly introduce to the reader how the Spark components interact, and how execution flow for a Spark application falls into place.
This lesson expand on this by presenting a broader picture of the Spark Landscape in a typical but straightforward application’s execution. It also dives into two of the most common cluster topologies in which a Spark application can run.
The big picture
A diagram of the Spark ecosystem while executing a program is below:
The binding component between the application code that lives in the driver program and its execution on a cluster is usually a master node or the cluster manager. Which one is determined by how the cluster is set up and configured. Let’s expand on this a bit.
Master or cluster manager?
Installing Spark on physical components (machines or servers), thus creating a Spark supported cluster, is usually done in two ways.
The resulting physical setup of the cluster defines what is usually referred to as the cluster mode, in which Spark applications are executed. It also imposes different constraints on various functionalities that the cluster exposes to Spark as an overall computing resource.
The following paragraphs explain the two most common modes.
Standalone mode
Although the Standalone Mode involves considerably more manual intervention while being set up and maintained, it is simpler to execute than the Cluster Mode discussed below. Standalone Mode is comprised of two groups of components: A single Master node and a group of Worker nodes.
The Master node is primarily a Resource Manager targeted at clusters expected to run Spark applications only because it greedily utilizes all the resources available when executing a Spark application. ...