The groupBy and groupByKey methods
Let’s learn two major transformations to group data that might also cause data shuffling.
Grouping data
While working with a DataFrame’s columns of values representing an ID or key of some sort, some business scenarios exist to group and process information based on it.
As we’ve seen in the previous lesson, we have no control over how Spark might initially allocate the rows among the partitions and nodes where these reside. Still, we can use a grouping transformation to bring them closer based on the key column.
To achieve this, the Spark API introduces us to the groupBy
and groupByKey
operations.
The groupBy
method
Elements can be grouped together based on one or more fields acting as the grouping criteria, with the help of the groupBy(...)
method provided by the DataFrame API.
However, just like when working with Databases, the grouped information is quite useless without an operation performed on it. The type of operation that Spark links to a groupBy operation is an aggregate function, which can be, for example:
- Count
- Avg
- Sum
- … others
The “Actions (II): Reduce and Aggregated Functions: Max, Min and Mean” lesson introduced aggregated functions. Still, the groupBy
transformation paints the most precise picture of them, so let’s begin by depicting it.
We’ll follow our tradition of illustrating transformations, but for this operation, we’ll change the layout to show a more comprehensive image of the grouping operation flow:
Get hands-on with 1200+ tech skills courses.