Advanced PySpark DataFrame Operations
Get an overview of advanced PySpark DataFrame operations.
We'll cover the following...
Advanced PySpark DataFrame operations enable us to perform complicated tasks. They are broadly divided into joins and Window functions. Let’s understand these now.
Joins
Joins are used to combine two or more PySpark DataFrames based on a common column or set of columns. The common column(s) are used to match the rows from the two DataFrames, and the result is a new DataFrame that contains columns from both DataFrames. PySpark supports several types of joins, including inner
join, outer
join, left
join, right
join, and semi-join
.
Here’s an example of how to perform joins between two DataFrames:
Press + to interact
main.py
file2.csv
file.csv
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("DataFrameActions").getOrCreate()print(f'Import "file.csv" into PySpark DataFrame')df1 = spark.read.csv("file.csv", header=True, inferSchema=True)print(f'Import "file2.csv" into PySpark DataFrame')df2 = spark.read.csv("file2.csv", header=True, inferSchema=True)print("Join dataframe 2 with DataFrame 1 using inner")joined_df = df1.join(df2, on="id", how="inner")print("Showing the joined DataFrame using inner join:")joined_df.show()print("Join dataframe 2 with DataFrame 1 using outer")joined_df = df1.join(df2, on="id", how="outer")print("Showing the joined DataFrame using outer join:")joined_df.show()
Let’s understand the code one line at a time:
- Line 1: Import the
SparkSession
class from thepyspark.sql
module. - Line 2: Create a
SparkSession
using thebuilder
pattern and thegetOrCreate()
method. - Line 5: Use the
read.csv()
method of theSparkSession
to read the CSV file, “file.csv,” into a DataFramedf1
. The CSV file has a header row, and the schema will be inferred automatically. - Line 8: Use the
read.csv()
method of theSparkSession
to read the CSV file, “file2.csv,” into a DataFramedf2
. The CSV file has a header row, and the schema will be inferred automatically. - Line 11: Use the
join()
method of the DataFramedf1
to perform an inner join withdf2
on the “id” column. The resulting joined DataFrame is assigned tojoined_df
. - Line 14: Use the
show()
method of the DataFramejoined_df
to display the contents of the joined DataFrame. - Line 17: Reassign
joined_df
by performing an outer join betweendf1
anddf2
on the “id” column. - Line 20: Use the
show()
method of the DataFramejoined_df
to