...

Advanced PySpark DataFrame Operations

Get an overview of advanced PySpark DataFrame operations.

We'll cover the following...

Joins
Window functions

Advanced PySpark DataFrame operations enable us to perform complicated tasks. They are broadly divided into joins and Window functions. Let’s understand these now.

Joins

Joins are used to combine two or more PySpark DataFrames based on a common column or set of columns. The common column(s) are used to match the rows from the two DataFrames, and the result is a new DataFrame that contains columns from both DataFrames. PySpark supports several types of joins, including inner join, outer join, left join, right join, and semi-join.

Here’s an example of how to perform joins between two DataFrames:

Press + to interact

Python 3.8

Files

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataFrameActions").getOrCreate()
print(f'Import "file.csv" into PySpark DataFrame')
df1 = spark.read.csv("file.csv", header=True, inferSchema=True)
print(f'Import "file2.csv" into PySpark DataFrame')
df2 = spark.read.csv("file2.csv", header=True, inferSchema=True)
print("Join dataframe 2 with DataFrame 1 using inner")
joined_df = df1.join(df2, on="id", how="inner")
print("Showing the joined DataFrame using inner join:")
joined_df.show()
print("Join dataframe 2 with DataFrame 1 using outer")
joined_df = df1.join(df2, on="id", how="outer")
print("Showing the joined DataFrame using outer join:")
joined_df.show()

Let’s understand the code one line at a time:

Line 1: Import the SparkSession class from the pyspark.sql module.
Line 2: Create a SparkSession using the builder pattern and the getOrCreate() method.
Line 5: Use the read.csv() method of the SparkSession to read the CSV file, “file.csv,” into a DataFrame df1. The CSV file has a header row, and the schema will be inferred automatically.
Line 8: Use the read.csv() method of the SparkSession to read the CSV file, “file2.csv,” into a DataFrame df2. The CSV file has a header row, and the schema will be inferred automatically.
Line 11: Use the join() method of the DataFrame df1 to perform an inner join with df2 on the “id” column. The resulting joined DataFrame is assigned to joined_df.
Line 14: Use the show() method of the DataFrame joined_df to display the contents of the joined DataFrame.
Line 17: Reassign joined_df by performing an outer join between df1 and df2 on the “id” column.
Line 20: Use the show() method of the DataFrame joined_df to display the contents of the

...

Introduction to the Course

Introduction to Big Data

Exploring PySpark Core and RDDs

PySpark DataFrames and SQL

Customer Churn Analysis Using PySpark

Machine Learning with PySpark

Modeling with PySpark MLlib

Predicting Diabetes in Patients Using PySpark MLlib

Performance Optimization in PySpark

PySpark Optimization: Analyzing NYC Restaurants Data

Integrating PySpark with Other Big Data Tools

Wrap Up

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Advanced PySpark DataFrame Operations

Joins