Working with DataFrames
Work through an example to learn how to read and write Spark DataFrames.
We'll cover the following
Reading and writing data in Spark is very convenient given the high-level abstractions available to connect to a variety of external data sources, such as Kafka, RDBMSs, or NoSQL stores. Spark provides an interface, DataFrameReader
, that allows us to read data into a DataFrame from various sources and in a number of formats such as JSON, CSV, Parquet, or Text.
For any meaningful data analysis, we’ll be creating DataFrames from data files. When doing so, we can either instruct Spark to infer the schema of the data itself or specify it for Spark. Let’s see examples of both below:
Creating DataFrames
The following snippet reads the data file BollywoodMovieDetail.csv
from the location /data/BollywoodMovieDetail.csv
.
val movies = spark.read.format("csv")
.option("header","true")
.option("inferSchema","true")
.load("/data/BollywoodMovieDetail.csv")
The Spark inferred schema can be examined as follows:
movies.schema
The output of the above command is shown in the widget below:
Get hands-on with 1400+ tech skills courses.