How to drop duplicate columns in Pyspark

Duplicate columns in a DataFrame can lead to more memory consumption of the DataFrame and duplicated data. Hence, duplicate columns can be dropped in a spark DataFrame by the following steps:

Determine which columns are duplicate
Drop the columns that are duplicate

Determining duplicate columns

Two columns are duplicated if both columns have the same data. Find out the list of duplicate columns.

Dropping duplicate columns

The drop() method can be used to drop one or more columns of a DataFrame in spark.

Instead of dropping the columns, we can select the non-duplicate columns.

Note: To learn more about dropping columns, refer to how to drop multiple columns from a PySpark DataFrame.

Code example

Let’s look at the code below:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("James","James","Smith","USA","CA", "USA"),
    ("Michael","Michael","Rose","Russia","Novogrod", "Russia"),
    ("Robert","Robert","Williams","Canada","Ontario", "Canada"),
    ("Maria","Maria","Jones","Australia","Perth", "Australia")
  ]
columns = ["firstname","firstname_dup","lastname","country","state","country_duplicate"]
df = spark.createDataFrame(data = data, schema = columns)
dup_cols = ["country_duplicate", "firstname_dup"]
new_df = df.drop(*dup_cols)
print("-" * 8)
print("Dataframe after removing the duplicate columns")
new_df.show(truncate=False)

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

TRENDING TOPICS

Learn to Code

Tech Interview Prep

Generative AI

Data Science

Machine Learning

GitHub Students Scholarship

Early Access Courses

Blind 75

Layoffs

Pricing

For Individuals

Try for Free

Gift a Subscription

CONTRIBUTE

Become an Author

Become an Affiliate

Earn Referral Credits

RESOURCES

Blog

Cheatsheets

Webinars

Answers

ABOUT US

Our Team

Careers

Hiring

Frequently Asked Questions

Press

LEGAL

Cookie Policy

Business Terms of Service

Data Processing Agreement

INTERVIEW PREP COURSES

Grokking the Modern System Design Interview

Grokking the Product Architecture Design Interview

Grokking the Coding Interview Patterns

Machine Learning System Design

How to drop duplicate columns in Pyspark

Determining duplicate columns

Dropping duplicate columns

Code example

Code explanation