Mastering Big Data with Apache Spark and Java/

...

Performance Fundamentals and Recipes

Learn the fundamentals of performance for a Spark application.

We'll cover the following...

Resource utilization

Avoid shuffles when possible

Partitioning and degree of parallelism
Executor configurations

Many factors and constraints affect an application’s execution performance, such as its architecture, resources available, and non-functional requirements like data encryption. No one magic recipe can account for the myriad of applications and their nature when it comes to performance considerations.

Ultimately, applying a systemic approach, one of testing, gathering metrics and results, doing changes, testing again, and repeating the process might shed some light on the bottlenecks, overheads, or just poor design of an application. On the other hand, specific third-party libraries and frameworks, like Spark, are designed in a manner that imposes constraints on the application that uses its APIs. We have seen an example of this in the case of immutability, specifically when we talked about the impossibility of changing a DataFrame when a transformation is applied to it. Instead, a new DataFrame is always returned with changes reflected.

Constraints like this don’t have to be a foe, but rather a friend, when it comes to utilizing Spark in a performant way. With this in mind, this lesson tries to provide general guidelines and explain the fundamentals for setting the foundation of a robust Spark application, as well as describing some recipes commonly used in Spark developments.

Note: Application performance optimization tends to be a very complex topic that unfortunately cannot be explained in one or even several lessons, so this lesson might pack a lot of concepts. The best recipe might simply be, at the end of the day, getting our hands dirty with the development and testing of a Spark application dealing with huge volumes of information; this goes beyond the scope of this course.

Let’s start with the basics.

Resource utilization

A common situation is to observe that a Spark application might use intensively most, if not all, the resources of the cluster where it runs.

This can be a good thing, after all, we don’t want resources we might be paying by the hour, the minute, or the second (especially in cloud environments) to go idle, thus wasting precious processing time. On the other hand, the processing of if the few volumes of the application in question are processing are not considerable this can point to incorrect resource utilization. This also informs us of a concerning processing throughput, that is, for a specified amount of time the amount of records effectively processed is low. Throughput means, in classical terms, a rate of items processed in a defined unit of time.

So let’s see some recipes or recommendations that lead to increased ...

Course Introduction

Spark Introduction and Basics

Getting Started with Spark

DataFrame Basic Operations

DataFrame Advanced Operations

Spark SQL and Other Functionalities

Building a Big Data Batch Application

Deployment and Cluster Execution

Monitoring and Performance Fundamentals

Conclusion

Apendix

Performance Fundamentals and Recipes

Resource utilization