Working with Datasets
Hands-on exercise to learn about creating Datasets and performing various operations on them.
We'll cover the following
We’ll focus on creating and working with Datasets in the Scala shell for Spark.
Creating Datasets
As is the case with DataFrames, we need to specify the schema when creating Datasets. We can also rely on having Spark infer the schema, but like with DataFrames, it is an expensive and sometimes error-prone approach. Ideally, we should specify the data types of the fields that make up our data.
When instantiating Datasets in Scala we can use case class in Scala to define the schema. For instance, the data we have been looking at has the following fields:
imdbId | title | releaseYear | releaseDate | genre | writers | actors | directors | sequel | hitFlop |
---|
We can create a corresponding Scala case class as follows:
case class BMovieDetail(imdbID: String,
title: String,
releaseYear: Int,
releaseDate: String,
genre: String,
writers: String,
actors: String,
directors: String,
sequel: String,
rating: Int)
We can get the schema of the class created above and use it to read the CSV file. The sequence of commands to do so are shown in the widget below:
Get hands-on with 1400+ tech skills courses.