Overview

In this course, we’ll learn how to use PySpark instead of pandas whereever possible. In Python, pandas is a library used to manipulate and analyze data.PySpark is a set of libraries written in Scala and is used for large-scale data processing.

We’ll use a subset of Amazon Review DataJustifying recommendations using distantly-labeled reviews and fined-grained aspects

Jianmo Ni, Jiacheng Li, Julian McAuley

Empirical Methods in Natural Language Processing (EMNLP), 2019
To demonstrate the modules of PySpark DataFrame API. In each part, we’ll first solve some tasks using pandas. Then we’ll try to accomplish the same task in PySpark.

Load or read the data, such as CSV, JSON, and parquet, in the tabular form with pandas or PySpark.
Select fields based on project requirements. This is called subsetting.
Explore a bit if the data is new to you.
Filter or impute the invalid data.
Introduce new calculated columns based on existing columns by aggregating the data with a framework, such as pandas or PySpark. We can do this using the provided methods—group by, order by, limit, and so on.
Calculate some metrics or produce visualization, which can easily be reviewed by business partners as a support document when making some data-driven decision

Introduction

Data Input/Output

Data Transformation

User Defined Function (UDF)

Wrapping Up

Appendix

Apriori Algorithm for Finding Frequent Itemsets with PySpark

Getting Started

Overview

Obtain valuable information from data

Useful tips