Duplicate columns in a DataFrame can lead to more memory consumption of the DataFrame and duplicated data. Hence, duplicate columns can be dropped in a spark DataFrame by the following steps:
Two columns are duplicated if both columns have the same data. Find out the list of duplicate columns.
The drop()
method can be used to drop one or more columns of a DataFrame in spark.
Instead of dropping the columns, we can select the non-duplicate columns.
Note: To learn more about dropping columns, refer to how to drop multiple columns from a PySpark DataFrame.
Let’s look at the code below:
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('edpresso').getOrCreate()data = [("James","James","Smith","USA","CA", "USA"),("Michael","Michael","Rose","Russia","Novogrod", "Russia"),("Robert","Robert","Williams","Canada","Ontario", "Canada"),("Maria","Maria","Jones","Australia","Perth", "Australia")]columns = ["firstname","firstname_dup","lastname","country","state","country_duplicate"]df = spark.createDataFrame(data = data, schema = columns)dup_cols = ["country_duplicate", "firstname_dup"]new_df = df.drop(*dup_cols)print("-" * 8)print("Dataframe after removing the duplicate columns")new_df.show(truncate=False)
pyspark
and spark session are imported.Free Resources