Deploying and Running a Spark Application
Learn how to build, deploy, and run a Spark application to execute our application on a cluster.
We'll cover the following...
Cluster’s components and interactions
The previous lesson introduced a diagram for the most relevant logical and physical parts in Spark cluster mode, and it provided a view of the different interactions between them.
However, to expand on how these interact when a Spark application runs in a cluster, it’s helpful to take a look at a somewhat more dynamic picture, which the below image (similar to the previous diagram) presents, and in which we can highlight the main parts and interactions:
First, it’s important to clarify that the entities on the diagram ultimately represent logical components.
All of these components run in different JVM processes, and they could run in different physical places too, depending on different deployment modes, which in turn impact the way the Spark application is configured and executed.
Application deployment modes
A Spark application is deployed to run in the cluster following different strategies, let’s briefly learn them.
Client deployment mode
Client deployment mode allows an application to be submitted from a machine in the cluster acting as a gateway or edge node; that is, a node in the cluster is used to submit the application, in which the driver process runs for the duration of execution.
There are two ways to do this, which are linked to the previous lesson’s cluster mode description:
For instance, if the standalone mode is the cluster mode of choice, then a master process (Spark’s own) serves as the actual cluster manager, and in this case, the driver program is run in the same node as this process does, that is, in the master node.
However, if the cluster mode chosen is based on any third-party cluster manager—for example, YARN or Mesos—the application can be deployed from the same node where such manager resides and acts as the master node, but it can also be deployed through interfaces that the cluster manager might provide, such as an API or commands built into it.
In either case, running the driver process within the cluster ensures network visibility and communication with both the executor processes on all the worker nodes and the master process in charge of resources allocation.
Note: This is the default
spark-submit
command deploy mode, but it can be changed using different configurations passed to the command.
Some points to highlight about this deployment mode:
- If the
spark-submit
command is used and terminated by a user, the application is also terminated. - It is usually used for testing purposes and not intended for production environments.
- If a gateway