...

/

More Operations with DataFrames

More Operations with DataFrames

Get hands-on practice exploring various operations that can be performed on DataFrames.

We can also rename, drop or change the data type of DataFrame columns. Let’s see examples of these.

Changing column names

Our data has a rather awkward name for the column that represents movie rating: hitFlop. We can rename the column to the more appropriate name, “Rating,” using the withColumnRenamed method.

scala> val moviesNewColDF = movies.withColumnRenamed("hitFlop","Rating")
moviesNewColDF: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]

scala> moviesNewColDF.printSchema
root
 |-- imdbId: string (nullable = true)
 |-- title: string (nullable = true)
 |-- releaseYear: string (nullable = true)
 |-- releaseDate: string (nullable = true)
 |-- genre: string (nullable = true)
 |-- writers: string (nullable = true)
 |-- actors: string (nullable = true)
 |-- directors: string (nullable = true)
 |-- sequel: integer (nullable = true)
 |-- Rating: integer (nullable = true)

The original DataFrame movies isn’t changed, rather we add a new DataFrame that’s created with the changed column name.

Changing column types

In our original movies DataFrame, the column releaseDate is inferred as string type instead of date type if we don’t use the samplingRatio option. To fix this, we can create a new column from the releaseDate column and interpret it as a date type using the withColumn method.

scala> val newDF = movies.withColumn("launchDate", to_date($"releaseDate", "d MMM yyyy"))
                         .drop("releaseDate")
k: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]

scala>
...