Avoid Global Scope

Learn about the good practices in pandas and PySpark.

DataFrames in global scope

The following code is an example of small DataFrames in the global scope, which should be converted into a series of functions so that we can avoid polluting the global scope:

total_review_by_mth_df = (
    main_df
    .groupBy('review_year','review_month') 
    .agg(fn.count(col("asin"))
    .alias("total_review")) 
    .orderBy('review_year', 'review_month')
)
total_review_2016 = total_review_by_mth_df.filter(col("review_year") == 2016)
total_review_2017 = total_review_by_mth_df.filter(col("review_year") == 2017)
merged_20_16_17 = (
    total_review_2016
    .select(
       "review_month",
       col("total_review").alias("total_review_2016")
    ) 
    .join(
       total_review_2017
       .select("review_month",col("total_review")
       .alias("total_review_2017")),
       on="review_month"
    )
)
merged_20_16_17.show()

Good practice in production environment

Good practice in a production environment# The aggregation and subsetting of a DataFrame can be done through a chain of function calls, as shown below:

Get hands-on with 1400+ tech skills courses.