Introduction

While MLlib provides scalable implementations for classic machine learning algorithms, it does not natively support deep learning libraries such as TensorFlow and PyTorch. There are libraries that parallelize the training of deep learning models on Spark, but the dataset needs to be able to fit in memory on each worker node, and these approaches are best used for distributed hyperparameter tuning on medium-sized datasets.

For the model application stage, where we already have a deep learning model trained but need to apply the resulting model to a large user base, we can use Pandas UDFs. With Pandas UDFs, we can:

partition and distribute our dataset.
run the resulting dataframes against a Keras model.
compile the results back into a single large Spark datagram.

This lesson shows how we can take the Keras model that we built in Keras Regression and scale it to larger datasets ...

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

Distributed Deep Learning

Introduction