Spark's Java Main Abstraction: The DataFrame
Get introduced to Spark's main abstraction in this lesson.
What is a DataFrame?
A DataFrame is both a logical container of data and an API, purposely built as a higher abstraction to the RDDs, as an older Spark abstraction in the case of the Java API and JavaRDDs.
In the Spark context, “logical container” defines a placeholder for data that spark loads and distributes, while the worker nodes process on an actual physical cluster.
The DataFrame provides a simple yet powerful API to simplify distributed data processing. That is, it hides the complexity and the necessity for developers to write difficult code that executes applications in a cluster.
Just like the RDDs, but going one step further, DataFrames leverage the power of distributed processing that a big data processing model needs to deal with huge amounts of information.
Some of its main features are:
-
The ability to scale from a reduced amount of bytes on a local or single machine to petabytes on a cluster.
-
Support for a wide range of sources and formats for reading data.
-
Code execution optimization through the Spark SQL Catalyst Optimizer.
Note: The Catalyst Optimizer is a complex and lengthy topic to cover in this course, but the following link can provide more information: DataBrick optimizer docs. ...