Gain insights into Spark, its architecture, application lifecycle, and APIs. Delve into data frames, datasets, and Spark SQL to effectively manage and query big data.

spark.tar.gz

SparkShellUI

SparkHistoryServerUI

Spark has come to dominate the big data processing space in a short span of time since its release and now serves as the de-facto unified big data processing engine in the industry. 

In this course, you will get a complete introduction to the basics of Spark. You will start by learning about the architecture, the application lifecycle, and its API.

From there, you will dive into the data frame data structure and its API as well as the strongly-typed datasets API. Lastly, you’ll get into the Spark SQL engine which will allow you to issue queries on structured data with a schema.

By the end of this course, you will have the confidence to use Spark in any of your big data projects.

An Introduction to Spark

We'll focus on creating and working with Datasets in the Scala shell for Spark.

## Creating Datasets

As is the case with DataFrames, we need to specify the schema when creating Datasets. We can also rely on having Spark infer the schema, but like with DataFrames, it is an expensive and sometimes error-prone approach. Ideally, we should specify the data types of the fields that make up our data.

When instantiating Datasets in Scala we can use *case* class in Scala to define the schema. For instance, the data we have been looking at has the following fields:


|imdbId|title|releaseYear|releaseDate|genre|writers|actors|directors|sequel|hitFlop|
| - | - | - | - | - | - | - | - | - | - 


We can create a corresponding Scala case class as follows:

```scala
case class BMovieDetail(imdbID: String,
title: String,
releaseYear: Int,
releaseDate: String,
genre: String,
writers: String,
actors: String,
directors: String,
sequel: String,
rating: Int)
```

We can get the schema of the class created above and use it to read the CSV file. The sequence of commands to do so are shown in the widget below:

We'll focus on creating and working with Datasets in the Scala shell for Spark.

# Creating Datasets

As is the case with DataFrames, we need to specify the schema when creating Datasets. We can also rely on having Spark infer the schema, but like with DataFrames, it is an expensive and sometimes error-prone approach. Ideally, we should specify the data types of the fields that make up our data.

When instantiating Datasets in Scala we can use *case* class in Scala to define the schema. For instance, the data we have been looking at has the following fields:


|imdbId|title|releaseYear|releaseDate|genre|writers|actors|directors|sequel|hitFlop|
| - | - | - | - | - | - | - | - | - | - 


We can create a corresponding Scala case class as follows:

```scala
case class BMovieDetail(imdbID: String,
title: String,
releaseYear: Int,
releaseDate: String,
genre: String,
writers: String,
actors: String,
directors: String,
sequel: String,
rating: Int)
```

We can get the schema of the class created above and use it to read the CSV file. The sequence of commands to do so are shown in the widget below:

Hands-on exercise to learn about creating Datasets and performing various operations on them.

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

Working with Datasets

Creating Datasets