Sort and OrderBy

Let’s learn how Spark sorts data and how to sort data with the Spark API.

Sorting in Spark

In the Spark Java API, there are several sorting methods available to bring some order to DataFrames’ records based on specified criteria. In this lesson, we introduce the two most common ones and go through some examples.

Sorting in Spark is internally a very complex operation that might depend on variables like these below:

  • Types of columns we use, such as date columns, numerical, alphabetical, and so on.

  • Order, and if we’re sorting in ascending or descending order.

  • Where the first and last values reside.

The nature of the sorting algorithm used falls outside the scope of this course; however, one thing we can be sure of: since data might be spread across many nodes, Spark cannot simply hold every piece of data in a single location, so shuffling is very likely to happen.

We can find the project for this lesson below:

mvn install exec:exec
Project with codebase for sorting examples in Spark

Sort

The first type of the sort operation comes in the shape of the homonym sort(...) method. The method is overloaded with a couple of different signatures, but let’s try to understand what all those have in common, regardless of the number of arguments required:

public Dataset<T> sort(final Column... sortExprs) { //implementation ...

We are required to pass the columns’ names that the sorting action uses as criteria for ordering a DataFrame’s rows.

And what are we sorting today? A DataFrame of attractions available to visit in Ireland.

The size of the dataset is roughly 1,700 records, a minor overhead for Spark to read and load into a DataFrame in the same fashion as we have been doing in all our lessons. Just out of ...