Datasets

Get an introduction to the strongly typed Datasets API available in Spark.

Datasets

Below is the definition of a Dataset from the official Databricks documentation:

“A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. Datasets are a type-safe structured API available in statically typed, Spark supported languages Java and Scala. Datasets are strictly a JVM language feature. Datasets aren’t supported in R and Python since these languages are dynamically typed languages”.

After Spark 2.0, RDD was replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.

Typed vs Untyped APIs
Typed vs Untyped APIs

In the context of Scala we can think of a DataFrame as an alias for a collection of generic objects represented as Dataset[Row]. The Row object is untyped and is a generic JVM object that can hold different types of fields. In contrast, a Dataset is a collection of strongly typed JVM objects in Scala or a class in Java. It is fair to say that each Dataset in Scala has an untyped view called DataFrame, which is a Dataset of Row. The following table captures the notion of Datasets and DataFrames in ...