Advanced PySpark DataFrame Operations
Get an overview of advanced PySpark DataFrame operations.
We'll cover the following...
We'll cover the following...
Advanced PySpark DataFrame operations enable us to perform complicated tasks. They are broadly divided into joins and Window functions. Let’s understand these now.
Joins
Joins are used to combine two or more PySpark DataFrames based on a common column or set of columns. The common column(s) are used to match the rows from the two DataFrames, and the result is a new DataFrame that contains columns from both DataFrames. PySpark supports several types of joins, including inner join, outer join, left join, right join, and semi-join.
Here’s an example of how to perform joins between two DataFrames:
Python 3.8
Files
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("DataFrameActions").getOrCreate()print(f'Import "file.csv" into PySpark DataFrame')df1 = spark.read.csv("file.csv", header=True, inferSchema=True)print(f'Import "file2.csv" into PySpark DataFrame')df2 = spark.read.csv("file2.csv", header=True, inferSchema=True)print("Join dataframe 2 with DataFrame 1 using inner")joined_df = df1.join(df2, on="id", how="inner")print("Showing the joined DataFrame using inner join:")joined_df.show()print("Join dataframe 2 with DataFrame 1 using outer")joined_df = df1.join(df2, on="id", how="outer")print("Showing the joined DataFrame using outer join:")joined_df.show()
Let’s understand the code one line at a time:
- Line 1: Import the
SparkSessionclass from thepyspark.sqlmodule. - Line 2: Create a
SparkSessionusing thebuilderpattern and thegetOrCreate()method. - Line 5: Use the
read.csv()method of theSparkSessionto read the CSV file, “file.csv,” into a DataFramedf1. The CSV file has a header row, and the schema will be inferred automatically. - Line 8: Use the
read.csv()method of theSparkSessionto read the CSV file, “file2.csv,” into a DataFramedf2. The CSV file has a header row, and the schema will be inferred automatically. - Line 11: Use the
join()method of the DataFramedf1to perform an inner join withdf2on the “id” column. The resulting joined DataFrame is assigned tojoined_df. - Line 14: Use the
show()method of the DataFramejoined_dfto display the contents of the joined DataFrame. - Line 17: Reassign
joined_dfby performing an outer join betweendf1anddf2on the “id” column. - Line 20: Use the
show()method of the DataFramejoined_dfto display the contents of the