Solution: Data Input and Output
Explore how to efficiently handle data input and output operations in PySpark, including reading datasets, renaming and selecting columns, repartitioning with bucketing and sorting, and writing data in Parquet format. This lesson helps you gain practical skills for managing large distributed datasets.
We'll cover the following...
We'll cover the following...
Task
Save the data set as a distributed data set with proper bucketing and sorting.
Solution
def create_spark_session():
"""Create a Spark Session"""
_ = load_dotenv()
spark = SparkSession.builder.appName("SparkApp").master("local[5]").getOrCreate()
return spark
def read_sdf(spark,PATH_BIGDATA):
"""Read the dataset"""
raw_sdf=spark.read.json(PATH_BIGDATA)
return raw_sdf
def rename_columns(df, column_map):
"""Rename the columns"""
for old, new in column_map.items():
df = df.withColumnRenamed(old, new)
return df
def select_subset(df,columns):
"""Select the required columns"""
df=df.select(*columns)
return df
def repartitioning_and_saving(df,PATH_SNAPSHOT):
"""Repartitioning and saving the snapshot"""
df = df.repartition('reviewed_year', 'reviewed_month').sortWithinPartitions("asin")
df.write.mode("overwrite").parquet(PATH_SNAPSHOT)
Solution of data input and output challenge
Explanation
-
Line 3: We read the key-value pair from
.env...