Datasets
Get an introduction to the strongly typed Datasets API available in Spark.
We'll cover the following...
Datasets
Below is the definition of a Dataset from the official Databricks documentation:
“A Dataset is a strongly-typed, immutable collection of objects that are mapped to a relational schema. Datasets are a type-safe structured API available in statically typed, Spark supported languages Java and Scala. Datasets are strictly a JVM language feature. Datasets aren’t supported in R and Python since these languages are dynamically typed languages”.
After Spark 2.0, RDD was replaced by Dataset, which is strongly-typed like an RDD, but with richer optimizations under the hood.
In the context of Scala we can think of a DataFrame as an alias for a collection of generic objects represented as Dataset[Row]
. The Row
object is untyped and is a generic JVM object that can hold different types of fields. In contrast, a Dataset is a collection of strongly typed JVM objects in Scala or a class in Java. It is fair to say that each Dataset in Scala has an untyped view called DataFrame, which is a Dataset of Row
. The following table captures the notion of Datasets and DataFrames in ...