Working with Datasets

Hands-on exercise to learn about creating Datasets and performing various operations on them.

We'll cover the following

We’ll focus on creating and working with Datasets in the Scala shell for Spark.

Creating Datasets

As is the case with DataFrames, we need to specify the schema when creating Datasets. We can also rely on having Spark infer the schema, but like with DataFrames, it is an expensive and sometimes error-prone approach. Ideally, we should specify the data types of the fields that make up our data.

When instantiating Datasets in Scala we can use case class in Scala to define the schema. For instance, the data we have been looking at has the following fields:

imdbId title releaseYear releaseDate genre writers actors directors sequel hitFlop

We can create a corresponding Scala case class as follows:

case class BMovieDetail(imdbID: String,
title: String,
releaseYear: Int,
releaseDate: String,
genre: String,
writers: String,
actors: String,
directors: String,
sequel: String,
rating: Int)

We can get the schema of the class created above and use it to read the CSV file. The sequence of commands to do so are shown in the widget below:

Get hands-on with 1400+ tech skills courses.