PySpark workflows

Data is essential for PySpark workflows. Spark supports a variety of methods for reading in datasets, including connecting to data lakes and data warehouses, as well as loading sample datasets from libraries, such as the Boston housing dataset. Since the theme of this course is building scalable pipelines, we’ll focus on using data layers that work with distributed workflows.

To get started with PySpark, we’ll stage input data for a model pipeline on S3 and then read in the dataset as a Spark dataframe.

This lesson demonstrates how to stage data to S3, set up credentials for accessing the data from Spark, and fetch the data from S3 into a Spark dataframe.

Setting up an S3 bucket

The first step is to set up a bucket on S3 for storing the dataset we want to load. To perform this step, run the following operations on the command line.

Press + to interact

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

Staging Data

PySpark workflows

Setting up an S3 bucket