...

/

Introduction to Data Input and Output

Reading a dataset depends on the provided data source. We might need some preprocessing before we read the data into a PySpark DataFrame. We might get CSV, JSON, or Parquet format as an input source. The dataset we’re using in this course is in JSON format. Therefore, we’ll focus on reading JSON data.

We can also read data from databases to a PySpark DataFrame using the database-specific JDBC or ODBC drivers.

To run PySpark, we first need to initialize its session. Let’s take a quick look at how to do that.

Create a PySpark session

We will write a code to initialize the environment.

Code for creating PySpark session

Let’s write a code to initialize the environment. In the code below, we use the create_spark_session function to create a spark executor with four follower nodes and one leader node. These use five threads to accomplish any PySpark task.

Explanation

Lines 1–2: We import the required library, SparkSession. We use it to create a PySpark session.
Lines 3–12: We make a function to create a PySpark session, create_spark_session.
- Line 3: We define the function.
- Line 5: We load the environment.
- Lines 6–12: We return a PySpark session.
  - Line 9: We assign the name of the session, "SparkApp”.
  - Line 10: We create five threads as logical cores on our machine locally.
Line 13: We call the function to create a PySpark session.
Line 14: We print that our session has started.

After a successful code execution, we’ll see the message “Code Executed Successfully” in the terminal.

Introduction

Data Input/Output

Data Transformation

User Defined Function (UDF)

Wrapping Up

Appendix

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Introduction to Data Input and Output

Overview

Data input and output flow

Create a PySpark session

Code for creating PySpark session

Explanation