Datasets
Explore the core concepts of Spark Datasets, focusing on their creation, schema definitions, and the role of encoders. Understand how Datasets differ from DataFrames in supporting strongly typed JVM objects, and discover when using Datasets is beneficial for type safety and performance in big data applications.
We'll cover the following...
Datasets
Below is the definition of a Dataset from the official Databricks documentation:
“A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. Datasets are a type-safe structured API available in statically typed, Spark supported languages Java and Scala. Datasets are strictly a JVM language feature. Datasets aren’t supported in R and Python since these languages are dynamically typed languages”.
After Spark 2.0, RDD was replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.
In the context of Scala we can think of a DataFrame as an alias for a collection of generic objects represented as Dataset[Row]. The Row object is untyped and is a generic JVM object that can hold different types of fields. In contrast, a Dataset is a collection of strongly typed JVM objects in Scala or a class in Java. It is fair to say that each Dataset in Scala has an untyped view called DataFrame, which is a Dataset of Row. The following table captures the notion of Datasets and DataFrames ...