Dataset: a DataFrame of POJOs
Learn about the Dataset abstraction and its relation to a DataFrame and the map() function.
What is a dataset?
In previous lessons, we showed code snippets where the following was referred to as a DataFrame:
Dataset<Row> df = ...
In the Spark world and by convention, a dataset of rows is referred to as a DataFrame, but dataset objects typed to any different classes, including Plain Old Java Objects (POJOs), are called datasets.
The name isn’t the only difference. A DataFrame in Spark, or “dataset of rows”, comes with a richer API out of the box.
We’ve already used some methods from that API to manipulate the schema, and that’s just the tip of the iceberg.
Benefits of using a dataset
The main benefit of using a Dataset is the possibility of typing it to a POJO or an object from our business domain. In programmatic terms, it means following the below syntax:
Dataset<MyClass>
In turn, this means we’re not limited to working with DataFrames of Spark types (Integer, String, Binary, Date, etc.). Instead, it’s possible to have and map information to a collection of objects from our application’s domain.
One limitation though, is that the totality of the DataFrame API (and methods that it exposes) won’t be available for Datasets typed to our POJOs. However, there are some workarounds to mitigate this, such as custom mapping and conversions between the two Spark abstractions.
The code example
It’s time to play around with Datasets. As usual, it might be of considerable help to diagram what this project involving Datasets does.
Outline of the project’s flow
The below diagram shows the DataFrame as a Dataset of Row type and the conversion to make it a Dataset of Car type (example POJO used in the project.)
Get hands-on with 1300+ tech skills courses.