Spark's Java Main Abstraction: The DataFrame

What is a DataFrame?

A DataFrame is both a logical container of data and an API, purposely built as a higher abstraction to the RDDs, as an older Spark abstraction in the case of the Java API and JavaRDDs.

In the Spark context, “logical container” defines a placeholder for data that spark loads and distributes, while the worker nodes process on an actual physical cluster.

The DataFrame provides a simple yet powerful API to simplify distributed data processing. That is, it hides the complexity and the necessity for developers to write difficult code that executes applications in a cluster.

Just like the RDDs, but going one step further, DataFrames leverage the power of distributed processing that a big data processing model needs to deal with huge amounts of information.

Some of its main features are:

  • The ability to scale from a reduced amount of bytes on a local or single machine to petabytes on a cluster.

  • Support for a wide range of sources and formats for reading data.

  • Code execution optimization through the Spark SQL Catalyst Optimizer.

Note: The Catalyst Optimizer is a complex and lengthy topic to cover in this course, but the following link can provide more information: DataBrick optimizer docs.

Basics of the DataFrame

As previously described, DataFrames can be thought of as distributed collections of data. The basic structure of a DataFrame organizes this data logically into columns and rows, where each column has a name and a data type associated with it.

A foundational characteristic of the DataFrame API is that there are multiple implementations for different programming languages such as Scala, Java, or Python.

Another characteristic that provides flexibility is the ability to slice, split or filter the data contained within a DataFrame, both on the column and row level.

Note: More features are explained in individual lessons as the course develops.

The DataFrame API was also expanded over time, and the dataset construct was introduced, so let’s take a brief look at it.

Dataset: A Java Spark DataFrame implementation

For Java Developers, the dataset is yet another abstraction available built on top of the DataFrame. It is created to provide an interface more familiar to a strongly typed language like Java, but also naturally retains all the previous traits listed.

Conceptually, a dataset is a DataFrame typed into a Java object. (If this seems confusing, don’t worry. This course contains an entire lesson dedicated to it in this course.) This is no small difference, because it allows developed to catch errors at the compile-time, just as is the case with Java Generics and Collections.

The dataset exposes a very familiar type-safe and object-oriented interface for Java developers to work with in conjunction with classes of an application business domain. Still, it might also impose limitations on how Spark optimizes the operations applied to it behind the scenes.

Said limitation occurs, among other more complex reasons, simply because a dataset, of potentially any type of object, falls outside the sphere of structures the engine is familiar with while running pre-execution optimizations.

Ultimately, and like with every technology, there are always tradeoffs while using either a more domain-focused implementation of the Spark API datasets or by just relying on the out-of-the-box structures provided by the same API such as DataFrames, Row objects, etc.

These two options are not mutually exclusive, though. All in all, Spark brings enough flexibility to let developers choose one or the other or a mix of both data structures while implementing a solution based on this technology.

A comparison with the JDBC API

Analogies are sometimes quite helpful in understanding new topics. So, is there an existing library or technology to help or serve as an introduction to the concept of a DataFrame?

If you are familiar with the JDBC API, or Java JDBC library, the DataFrame can be analog to the ResultSet abstraction. Some reasons are as follows:

  • A ResultSet is an object that contains a set of results from a database operation, and it can be iterated or manipulated by making calls to its API methods. In the case of the DataFrame, this structure also contains information similar to what we expect as a result of a DB operation (a succession of rows), and each row can be fetched.
  • The underlying schema can be accessed and metadata about the structure where the information is represented can be retrieved both in JDBC and the DataFrame APIs.
  • Specific columns of a row can be retrieved to obtain specific values, instead of operating with just the whole row.

Differences with JDBC’s ResultSet

There are, however some significant differences to take into account:

  • A Data can be nested, so a DataFrame might hold nested information for certain row columns. This can be the case when working with an XML or JSON document as the source of the DataFrame, for example. Future practical lessons in the course show some examples of this.

  • DataFrames are immutable, which means that we cannot update or delete rows. Instead, we operate on new DataFrames created from others.

Note: As we progress in the lessons, we’ll see that immutability is the mark of the abstractions on the Spark API

  • Concerning the previous point, columns can be added or removed, thus producing a new DataFrame, so the structure of the DataFrame and how the data is represented is dynamic. This allows a developer to enrich its schema or reorganize the information.
  • Finally, another crucial difference is that there are no primary keys or indices on the DataFrame, even though some values can be treated as such in certain operations, because optimization is mostly managed by Spark.

DataFrame: A visual explanation

We can use a diagram yet again to aid our understanding of this important abstraction.

Get hands-on with 1400+ tech skills courses.