More Operations with DataFrames

Get hands-on practice exploring various operations that can be performed on DataFrames.

We'll cover the following...

We can also rename, drop or change the data type of DataFrame columns. Let’s see examples of these.

Changing column names

Our data has a rather awkward name for the column that represents movie rating: hitFlop. We can rename the column to the more appropriate name, “Rating,” using the withColumnRenamed method.

scala> val moviesNewColDF = movies.withColumnRenamed("hitFlop","Rating")
moviesNewColDF: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]

scala> moviesNewColDF.printSchema
root
 |-- imdbId: string (nullable = true)
 |-- title: string (nullable = true)
 |-- releaseYear: string (nullable = true)
 |-- releaseDate: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- writers: string (nullable = true)
 |-- actors: string (nullable = true)
 |-- directors: string (nullable = true)
 |-- sequel: integer (nullable = true)
 |-- Rating: integer (nullable = true)

The original DataFrame movies isn’t changed, rather we add a new DataFrame that’s created with the changed column name.

Changing column types

In our original movies DataFrame, the column releaseDate is inferred as string type instead of date type if we don’t use the samplingRatio option. To fix this, we can create a new column from the releaseDate column and interpret it as a date type using the withColumn method.

scala> val newDF = movies.withColumn("launchDate", to_date($"releaseDate", "d MMM yyyy"))
                         .drop("releaseDate")
k: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]

scala> newDF.printSchema
root

...

Spark Overview

DataFrames

Datasets

Spark SQL

Summary

More Operations with DataFrames

Changing column names

Changing column types