Avoid Global Scope
Learn about the good practices in pandas and PySpark.
We'll cover the following
DataFrames in global scope
The following code is an example of small DataFrames in the global scope, which should be converted into a series of functions so that we can avoid polluting the global scope:
total_review_by_mth_df = (
main_df
.groupBy('review_year','review_month')
.agg(fn.count(col("asin"))
.alias("total_review"))
.orderBy('review_year', 'review_month')
)
total_review_2016 = total_review_by_mth_df.filter(col("review_year") == 2016)
total_review_2017 = total_review_by_mth_df.filter(col("review_year") == 2017)
merged_20_16_17 = (
total_review_2016
.select(
"review_month",
col("total_review").alias("total_review_2016")
)
.join(
total_review_2017
.select("review_month",col("total_review")
.alias("total_review_2017")),
on="review_month"
)
)
merged_20_16_17.show()
Good practice in production environment
Good practice in a production environment# The aggregation and subsetting of a DataFrame can be done through a chain of function calls, as shown below:
Get hands-on with 1200+ tech skills courses.