Data Partitioning and Shuffling
Learn about the important concepts that every Spark developer should be familiar with: 'Partitioning' and 'Shuffling'.
Data partitioning and shuffling
The term “big data” refers to extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations. These large data sets can generally fit in the memory of a running JVM application (provided we didn’t leave a nasty footprint of memory leaks), until the data is sent to a persistent data storage like a DB.
When we’re asked to process volumes of information by the millions or billions, traditional strategies and systems will begin to be unable to perform the task at hand.
Luckily for us, Spark is a tool that comes to our aid, and allows us to process these humongous volumes. However, the following question remains: How does Spark fit massive volumes of information into its nodes?
This lesson intends to provide some insight into this question and touches on some important related concepts.
Note: Spark does have its limitations. As any technology it is no silver bullet, but it specializes in dealing with big data. We will learn about Spark Performance and Tuning techniques in an upcoming lesson, which will allow us to use Spark more efficiently with the resources we have at hand.
Data partitioning
In the first chapters, we learned about partitions. If you need a refresher, feel free to review our previous learning before proceeding with this lesson.
Data partitioning is the mechanism in which Spark divides the “to-be-processed” data into partitions and places them in multiple cluster nodes. This mechanism is crucial as it can affect both performance and the use of available resources.
Why is data partitioning crucial? Let’s go through a scenario to illustrate this.
A company processes sales based on an ID for each transaction that occurred. Some of the standard reports generated rely on grouping together different sales information for different sellers, based on an ID. The volumes of the sales can range by the millions.
As Spark developers and big data users, we’ll load the sales information from a File. Behind the scenes, Spark will partition the data. This process can be illustrated with the following diagram:
Get hands-on with 1400+ tech skills courses.