Introduction to Data Input and Output
Learn about the data input and output content flow.
Overview
The most common task encountered in a data analysis project is reading the data and writing it back in a new form, either as a final format or an updated version of the raw data.
Data input and output flow
The flow of data input and output is as follows:
- Read data into the pandas and PySpark DataFrame.
- Rename the columns of the DataFrame using the
withColumnRenamed
method. - Select a subset of columns only relevant to our analysis.
- Write the data back into a disk as a distributed dataset across multiple files using partition and sorting for better performance.
Reading a dataset depends on the provided data source. We might need some preprocessing before we read the data into a PySpark DataFrame. We might get CSV, JSON, or Parquet format as an input source. The dataset we’re using in this course is in JSON format. Therefore, we’ll focus on reading JSON data.
We can also read data from databases to a PySpark DataFrame using the database-specific JDBC or ODBC drivers.
To run PySpark, we first need to initialize its session. Let’s take a quick look at how to do that.
Create a PySpark session
We will write a code to initialize the environment.
Code for creating PySpark session
Let’s write a code to initialize the environment. In the code below, we use the create_spark_session
function to create a spark
executor with four follower nodes
and one leader node. These use five threads to accomplish any PySpark task.
from pyspark.sql import SparkSession from dotenv import load_dotenv def create_spark_session(): """Create a Spark Session""" _ = load_dotenv() return ( SparkSession .builder .appName("SparkApp") .master("local[5]") .getOrCreate() ) spark = create_spark_session() print('Session Started') print('Code Executed Successfully')
Explanation
-
Lines 1–2: We import the required library,
SparkSession
. We use it to create a PySpark session. -
Lines 3–12: We make a function to create a PySpark session,
create_spark_session
.- Line 3: We define the function.
- Line 5: We load the environment.
- Lines 6–12: We return a PySpark session.
- Line 9: We assign the name of the session,
"SparkApp”
. - Line 10: We create five threads as logical cores on our machine locally.
- Line 9: We assign the name of the session,
-
Line 13: We call the function to create a PySpark session.
-
Line 14: We print that our session has started.
After a successful code execution, we’ll see the message “Code Executed Successfully” in the terminal.