Search⌘ K

Solution: Data Input and Output

Explore how to efficiently handle data input and output operations in PySpark, including reading datasets, renaming and selecting columns, repartitioning with bucketing and sorting, and writing data in Parquet format. This lesson helps you gain practical skills for managing large distributed datasets.

We'll cover the following...

Task

Save the data set as a distributed data set with proper bucketing and sorting.

Solution

def create_spark_session():
    """Create a Spark Session"""
    _ = load_dotenv()
    spark = SparkSession.builder.appName("SparkApp").master("local[5]").getOrCreate()
    return spark


def read_sdf(spark,PATH_BIGDATA):
    """Read the dataset"""
    raw_sdf=spark.read.json(PATH_BIGDATA)
    return raw_sdf

def rename_columns(df, column_map):
    """Rename the columns"""
    for old, new in column_map.items():
        df = df.withColumnRenamed(old, new)

    return df

def select_subset(df,columns):
    """Select the required columns"""
    df=df.select(*columns)
    return df

def repartitioning_and_saving(df,PATH_SNAPSHOT):
    """Repartitioning and saving the snapshot"""
    df = df.repartition('reviewed_year', 'reviewed_month').sortWithinPartitions("asin")
    df.write.mode("overwrite").parquet(PATH_SNAPSHOT)
Solution of data input and output challenge

Explanation

  • Line 3: We read the key-value pair from .env ...