More Operations with DataFrames
Get hands-on practice exploring various operations that can be performed on DataFrames.
We'll cover the following...
We can also rename, drop or change the data type of DataFrame columns. Let’s see examples of these.
Changing column names
Our data has a rather awkward name for the column that represents movie rating: hitFlop
. We can rename the column to the more appropriate name, “Rating,” using the withColumnRenamed
method.
scala> val moviesNewColDF = movies.withColumnRenamed("hitFlop","Rating")
moviesNewColDF: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]
scala> moviesNewColDF.printSchema
root
|-- imdbId: string (nullable = true)
|-- title: string (nullable = true)
|-- releaseYear: string (nullable = true)
|-- releaseDate: string (nullable = true)
|-- genre: string (nullable = true)
|-- writers: string (nullable = true)
|-- actors: string (nullable = true)
|-- directors: string (nullable = true)
|-- sequel: integer (nullable = true)
|-- Rating: integer (nullable = true)
The original DataFrame movies
isn’t changed, rather we add a new DataFrame that’s created with the changed column name.
Changing column types
In our original movies
DataFrame, the column releaseDate
is inferred as string type instead of date type if we don’t use the samplingRatio
option. To fix this, we can create a new column from the releaseDate
column and interpret it as a date type using the withColumn
method.
scala> val newDF = movies.withColumn("launchDate", to_date($"releaseDate", "d MMM yyyy"))
.drop("releaseDate")
k: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]
scala>
...