The drop()
method in PySpark drops one or more columns of the DataFrame or dataset.
dataframe.drop(*cols)
cols
- These are the columns to be removed.The method returns a new DataFrame after deleting the specified columns.
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('edpresso').getOrCreate()data = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")]columns = ["firstname","lastname","country","state"]df = spark.createDataFrame(data = data, schema = columns)print("Initial dataframe")df.show(truncate=False)cols_to_remove = ["country", "firstname"]new_df = df.drop(*cols_to_remove)print("-" * 8)print("Dataframe after removing the columns")new_df.show(truncate=False)
Line 4: A spark session with the app’s Educative Answers is created.
Lines 6–10: We define data for the DataFrame.
Line 12: The columns of the DataFrame are defined.
Line 13: A DataFrame is created using the createDataframe()
method.
Lines 14–15: The original or initial DataFrame is printed.
Line 17: The columns to be removed are defined as cols_to_remove.
Line 19: The columns are dropped by invoking the drop()
method and passing the cols_to_remove
parameter.
Line 24: The new DataFrame with the columns removed is printed.
Free Resources