Cluster Monitoring and SparkUI
Get introduced to a useful tool included and shipped with the Spark library: SparkUI.
We'll cover the following...
Monitoring in SparkUI
The SparkUI is a user interface (UI) provided by the Spark libraries that allows the developer to query and inspect both the status of jobs and resources usage.
Note: By default found on port
4040
, though configurable, the tool’s address is expressed as URL:PORT, where URL is the master node or cluster manager IP address.
When used and understood correctly, SparkUI is a powerful tool to detect bottlenecks in a Spark application and draw conclusions as to how resources are utilized (or underutilized.)
One advantage over reading the logs is that the information is displayed in real-time; that is, the different visualizations present in the tool allow the developer to monitor a running Spark application as well as querying finished applications’ historical information.
Nevertheless, a combination of log scrutiny and conclusions drawn from observation of the SparkUI screens can always provide a more comprehensive picture.
Let’s go through the different parts of the UI application by running one of the earliest projects used in this course, albeit with some code removed to make things simpler:
mvn install exec:exec
The above project performs initially three operations that we use to show how they are tracked in SparkUI:
- It loads a CSV file from the project’s path.
- It shows the first five rows.
- And, lastly, it applies a
map
transformation and shows the first five transformed rows in the returning DataFrame.
With Spark running as a service in the master node to which the driver program is submitted, we can access the SparkUI service and consult finished applications and Jobs
within it. However, since we are running this in Maven, as soon as ...