Columns
Get hands-on practice performing various operations on the columns of a DataFrame.
We'll cover the following
Spark allows us to manipulate individual DataFrame columns using relational or computational expressions. Conceptually, columns represent a type of field and are similar to columns in pandas, R DataFrames, or relational tables. Columns are represented by the type Column
in Spark’s supported languages. Let’s see some examples of working with columns next.
Listing all columns
We’ll assume we have already read the file BollywoodMovieDetail.csv
in the DataFrame variable movies
as shown in the previous lesson. We can list the columns as follows:
scala> movies.columns
res2: Array[String] = Array(imdbId, title, releaseYear, releaseDate, genre, writers, actors, directors, sequel, hitFlop)
Accessing a single column
We can access a particular column from the DataFrame, by using the col()
method, which is a standard built-in function that returns an object of type Column
. In the snippet below, we access the column hitFlop
and then display five values.
scala> var ratingCol = movies.col("hitFlop")
ratingCol: org.apache.spark.sql.Column = hitFlop
scala> movies.select(ratingCol).show(5)
+-------+
|hitFlop|
+-------+
| 2|
| 6|
| 1|
| 4|
| 1|
+-------+
only showing top 5 rows
We can also select columns using select
and expr
in the following two ways:
scala> movies.select("hitFlop").show(5)
+-------+
|hitFlop|
+-------+
| 2|
| 6|
| 1|
| 4|
| 1|
+-------+
only showing top 5 rows
scala> movies.select(expr("hitFlop")).show(5)
+-------+
|hitFlop|
+-------+
| 2|
| 6|
| 1|
| 4|
| 1|
+-------+
only showing top 5 rows
Manipulating columns
We can create new columns by performing operations on existing ones. For example, we may want to list all the movies that have a rating above 5
.
scala> movies.select(expr("hitFlop > 5")).show(3)
+-------------+
|(hitFlop > 5)|
+-------------+
| false|
| true|
| false|
+-------------+
only showing top 3 rows
We can achieve the same result using the col()
method:
scala> movies.select(movies.col("hitFlop") > 5).show(3)
+-------------+
|(hitFlop > 5)|
+-------------+
| false|
| true|
| false|
+-------------+
only showing top 3 rows
Note that the output column name is displayed as the boolean condition we specified. We can concatenate the newly computed column with the existing DataFrame using the withColumn()
method as follows:
scala> movies.withColumn("Good Movies to Watch", expr("hitFlop > 5"))
.show(3)
The output from the above command is captured in the widget below. The additional column has the name “Good Movies to Watch” and consists of the result of the boolean expression hitFlop > 5
.
Get hands-on with 1400+ tech skills courses.