...

/

The groupBy and groupByKey methods

The groupBy and groupByKey methods

Let’s learn two major transformations to group data that might also cause data shuffling.

Grouping data

While working with a DataFrame’s columns of values representing an ID or key of some sort, some business scenarios exist to group and process information based on it.

As we’ve seen in the previous lesson, we have no control over how Spark might initially allocate the rows among the partitions and nodes where these reside. Still, we can use a grouping transformation to bring them closer based on the key column.

To achieve this, the Spark API introduces us to the groupBy and groupByKey operations.

The groupBy method

Elements can be grouped together based on one or more fields acting as the grouping criteria, with the help of the groupBy(...) method provided by the DataFrame API.

However, just like when working with Databases, the grouped information is quite useless without an operation performed on it. The type of operation that Spark links to a groupBy operation is an aggregate function, which can be, for example:

  • Count
  • Avg
  • Sum
  • … others

The “Actions (II): Reduce and Aggregated Functions: Max, Min and Mean” lesson introduced aggregated functions. Still, the groupBy transformation paints the most precise picture of them, so let’s begin by depicting it.

We’ll follow our tradition of illustrating transformations, but for this operation, we’ll change the layout to show a more comprehensive image of the grouping operation flow:

There are a couple of important points to highlight:

  1. The rows might be rearranged, and grouped, based on the IDs or Keys used in groupBy, following a principle of proximity at the partition level. This means that, rows with the same keys could end up in the same partition to speed up calculations applied in a subsequent step (an aggregated function) if moved around.

  2. As the number of rows matching the grouping Key(s) are hardly equal in amount, additional partitions might be created, as depicted by the partition “N-1” in the above diagram.

The following project exemplifies the grouping operation. We’ll omit here the common lines of code present in our Spark projects around bootstrapping the application, setting up the session, loading a file, and so forth. ...

Access this course and 1400+ top-rated courses and projects.