How to drop multiple columns from a PySpark DataFrame

Overview

The drop() method in PySpark drops one or more columns of the DataFrame or dataset.

Syntax

dataframe.drop(*cols)

Parameters

  • cols - These are the columns to be removed.

Return value

The method returns a new DataFrame after deleting the specified columns.

Example

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("James","Smith","USA","CA"),
("Michael","Rose","USA","NY"),
("Robert","Williams","USA","CA"),
("Maria","Jones","USA","FL")
]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
print("Initial dataframe")
df.show(truncate=False)
cols_to_remove = ["country", "firstname"]
new_df = df.drop(*cols_to_remove)
print("-" * 8)
print("Dataframe after removing the columns")
new_df.show(truncate=False)

Explanation

  • Line 4: A spark session with the app’s Educative Answers is created.

  • Lines 6–10: We define data for the DataFrame.

  • Line 12: The columns of the DataFrame are defined.

  • Line 13: A DataFrame is created using the createDataframe() method.

  • Lines 14–15: The original or initial DataFrame is printed.

  • Line 17: The columns to be removed are defined as cols_to_remove.

  • Line 19: The columns are dropped by invoking the drop() method and passing the cols_to_remove parameter.

  • Line 24: The new DataFrame with the columns removed is printed.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved