Spark SQL - An Example
Work through an example that demonstrates executing queries using Spark SQL.
We'll cover the following
Simple query example
When we execute Spark SQL queries in the spark-shell, we don’t need to explicitly create a SparkSession
object. We would if we were writing a Spark application that was to be run using spark-submit
. The SparkSession
object is provided implicitly by the shell. Read the listing below, which is similar to what we have done in the previous lessons, except for creating the temporary view, which we discuss later.
scala> val movies = spark.read.format("csv")
.option("header", "true")
.option("samplingRatio", 0.001)
.option("inferSchema", "true")
.load("/data/BollywoodMovieDetail.csv")
movies: org.apache.spark.sql.DataFrame = [imdbId: string, title: string ... 8 more fields]
scala> movies.createOrReplaceTempView("tempView")
scala> spark.sql("SELECT title FROM tempView WHERE releaseYear > 2010 ORDER BY title desc")
.show(3)
+--------------------+
| title|
+--------------------+
| Zokkomon|
| Zindagi Tere Naam|
|Zindagi Na Milegi...|
+--------------------+
only showing top 3 rows
The query returns all the movie titles that were released post-2010 and also sorts them in descending order by title. We then print three rows from the result. The sql()
method returns the results as a DataFrame. Spark SQL offers an ANSI:2003-compliant SQL interface and the query we executed should look and feel similar to a query against a relational database table such as MySQL or PostgreSQL. We can also express the same query using DataFrames API as follows:
movies.where($"releaseYear" > 2010)
.sort(desc("title"))
Complex query example
Let’s try a slightly more complex query. Imagine that we want to list all the movies released post 2010 and at the same time label each movie below average, average, and above average if the hitFlop
column is less than 5, equal to 5, and higher than 5, respectively. The query and its output are presented below:
Get hands-on with 1400+ tech skills courses.