

Spark's Java Main Abstraction: The DataFrame

Spark's Java Main Abstraction: The DataFrame

Get introduced to Spark's main abstraction in this lesson.

What is a DataFrame?

A DataFrame is both a logical container of data and an API, purposely built as a higher abstraction to the RDDs, as an older Spark abstraction in the case of the Java API and JavaRDDs.

In the Spark context, “logical container” defines a placeholder for data that spark loads and distributes, while the worker nodes process on an actual physical cluster.

The DataFrame provides a simple yet powerful API to simplify distributed data processing. That is, it hides the complexity and the necessity for developers to write difficult code that executes applications in a cluster.

Just like the RDDs, but going one step further, DataFrames leverage the power of distributed processing that a big data processing model needs to deal with huge amounts of information.

Some of its main features are:

  • The ability to scale from a reduced amount of bytes on a local or single machine to petabytes on a cluster.

  • Support for a wide range of sources and formats for reading data.

  • Code execution optimization through the Spark SQL Catalyst Optimizer.

Note: The Catalyst Optimizer is a complex and lengthy topic to cover in this course, but the following link can provide more information: DataBrick optimizer docs. ...