- Transforming Data
Manipulating and visualizing data in PySpark.
The PySpark Dataframe API provides a variety of useful functions for aggregating, filtering, pivoting, and summarizing data. While some of these functionalities map well to Pandas operations, my recommendation for quickly getting up and handling munging data in PySpark is to use the SQL interface in dataframes in Spark, called Spark SQL. If you’re already using the pandasql
or framequery
libraries, then Spark SQL should provide a familiar interface.
If you’re new to these libraries, then the SQL interface still provides an approachable way of working with the Spark ecosystem. We’ll cover the Dataframe API later but first, start with the SQL interface to get up and running.
Exploratory data analysis (EDA)
Exploratory data analysis (EDA) is one of the key steps in a data science workflow for understanding the shape of a dataset. To work through this process in PySpark, we’ll load the stats dataset into a dataframe, expose it as a view, and calculate the summary statistics. The snippet below shows how to load the NHL stats dataset, expose it as a view to Spark, and run a query against the dataframe. The aggregated dataframe is then visualized using the display
command in Databricks.
Get hands-on with 1300+ tech skills courses.