Data Science in Production: Building Scalable Model Pipelines/

...

MLlib Batch Pipeline

Learn about machine learning libraries in PySpark to build predictive models.

We'll cover the following...

MLlib

Loading the data
Splitting the data

Vector columns

Now that we’ve covered loading and transforming data with PySpark, we can use the machine learning libraries in PySpark to build a predictive model.

MLlib

The core library for building predictive models in PySpark is called MLlib. This library provides a suite of supervised and unsupervised algorithms.

While this library does not have complete coverage of all of the algorithms in sklearn, it provides functionality for the majority of the types of operations needed for data science workflows. In this section, we’ll show you how to apply MLlib to a classification problem and save the outputs from the model application to a data lake.

Press + to interact

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

MLlib Batch Pipeline

MLlib

Loading the data