Getting Started
Get introduced to the course and what we’ll learn.
We'll cover the following
Overview
In this course, we’ll learn how to use PySpark instead of pandas whereever possible. In Python, pandas is a library used to manipulate and analyze data.PySpark is a set of libraries written in Scala and is used for large-scale data processing.
We’ll use a subset of
Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019
To demonstrate the modules of PySpark DataFrame API. In each part, we’ll first solve some tasks using pandas. Then we’ll try to accomplish the same task in PySpark.
Obtain valuable information from data
The content flow of the course is followed by a short analytics project lifecycle. In this lifecycle, we follow an almost predetermined set of actions to get some valuable information out of the data, as shown in the illustration below.
- Load or read the data, such as CSV, JSON, and parquet, in the tabular form with pandas or PySpark.
- Select fields based on project requirements. This is called subsetting.
- Explore a bit if the data is new to you.
- Filter or impute the invalid data.
- Introduce new calculated columns based on existing columns by aggregating the data with a framework, such as pandas or PySpark. We can do this using the provided methods—group by, order by, limit, and so on.
- Calculate some metrics or produce visualization, which can easily be reviewed by business partners as a support document when making some data-driven decision
Useful tips
Always make a
snapshot
of the workingDataFrame
whenever it makes sense.
It reduces the extra overhead of querying the whole DataFrame
and makes our query much faster. Additionally, it allows us to get rid of redundant fields from the data we won’t use.
PySpark
uses cache to create a subset of data in the memory or save a subset of the data locally. It uses the subset for the further task, which increases the query speed significantly.