Mastering Big Data with Apache Spark and Java/

...

Deploying and Running a Spark Application

Learn how to build, deploy, and run a Spark application to execute our application on a cluster.

We'll cover the following...

Cluster’s components and interactions

Application deployment modes

Client deployment mode
Cluster deployment mode

Application’s execution flow

Building a Spark application

Building an application uber jar

Running the application on the cluster
Running a Spring Boot Spark application

Cluster’s components and interactions

The previous lesson introduced a diagram for the most relevant logical and physical parts in Spark cluster mode, and it provided a view of the different interactions between them.

However, to expand on how these interact when a Spark application runs in a cluster, it’s helpful to take a look at a somewhat more dynamic picture, which the below image (similar to the previous diagram) presents, and in which we can highlight the main parts and interactions:

First, it’s important to clarify that the entities on the diagram ultimately represent logical components.

All of these components run in different JVM processes, and they could run in different physical places too, depending on different deployment modes, which in turn impact the way the Spark application is configured and executed.

Application deployment modes

A Spark application is deployed to run in the cluster following different strategies, let’s briefly learn them.

Client deployment mode

Client deployment mode allows an application to be submitted from a machine in the cluster acting as a gateway or edge node; that is, a node in the cluster is used to submit the application, in which the driver process runs for the duration of execution.

There are two ways to do this, which are linked to the previous lesson’s cluster mode description:

For instance, if the standalone mode is the cluster mode of choice, then a master process (Spark’s own) serves as the actual cluster manager, and in this case, the driver program is run in the same node as this process does, that is, in the master node.

However, if the cluster mode chosen is based on any third-party cluster manager—for example, YARN or Mesos—the application can be deployed from the same node where such manager resides and acts as the master node, but it can also be deployed through interfaces that the cluster manager might provide, such as an API or commands built into it.

In either case, running the driver process within the cluster ensures network visibility and communication with both the executor processes on all the worker nodes and the master process in charge of resources allocation.

Note: This is the default spark-submit command deploy mode, but it can be changed using different configurations passed to the command.

Some points to highlight about this deployment mode:

If the spark-submit command is used and terminated by a user, the application is also terminated.
It is usually used for testing purposes and not intended for production environments.
If a gateway

...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Deploying and Running a Spark Application

Cluster’s components and interactions

Application deployment modes

Client deployment mode