How to drop duplicate columns in Pyspark

Duplicate columns in a DataFrame can lead to more memory consumption of the DataFrame and duplicated data. Hence, duplicate columns can be dropped in a spark DataFrame by the following steps:

  1. Determine which columns are duplicate
  2. Drop the columns that are duplicate

Determining duplicate columns

Two columns are duplicated if both columns have the same data. Find out the list of duplicate columns.

Dropping duplicate columns

The drop() method can be used to drop one or more columns of a DataFrame in spark.

Instead of dropping the columns, we can select the non-duplicate columns.

Note: To learn more about dropping columns, refer to how to drop multiple columns from a PySpark DataFrame.

Code example

Let’s look at the code below:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("James","James","Smith","USA","CA", "USA"),
("Michael","Michael","Rose","Russia","Novogrod", "Russia"),
("Robert","Robert","Williams","Canada","Ontario", "Canada"),
("Maria","Maria","Jones","Australia","Perth", "Australia")
]
columns = ["firstname","firstname_dup","lastname","country","state","country_duplicate"]
df = spark.createDataFrame(data = data, schema = columns)
dup_cols = ["country_duplicate", "firstname_dup"]
new_df = df.drop(*dup_cols)
print("-" * 8)
print("Dataframe after removing the duplicate columns")
new_df.show(truncate=False)

Code explanation

  • Lines 1-2: pyspark and spark session are imported.
  • Line 4: A spark session is created.
  • Lines 6-13: A DataFrame with duplicate columns is created.
  • Line 15: The list of duplicate columns are defined.
  • Line 17: New DataFrame with no duplicate columns is obtained by dropping the duplicate columns.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved