

Working with Datasets

Working with Datasets

Hands-on exercise to learn about creating Datasets and performing various operations on them.

We'll cover the following...

We’ll focus on creating and working with Datasets in the Scala shell for Spark.

Creating Datasets

As is the case with DataFrames, we need to specify the schema when creating Datasets. We can also rely on having Spark infer the schema, but like with DataFrames, it is an expensive and sometimes error-prone approach. Ideally, we should specify the data types of the fields that make up our data.

When instantiating Datasets in Scala we can use case class in Scala to define the schema. For instance, the data we have been looking at has the following fields:

imdbId title releaseYear releaseDate genre writers actors directors