Distributed Deep Learning
Keras model training and application in PySpark.
Introduction
While MLlib provides scalable implementations for classic machine learning algorithms, it does not natively support deep learning libraries such as TensorFlow and PyTorch. There are libraries that parallelize the training of deep learning models on Spark, but the dataset needs to be able to fit in memory on each worker node, and these approaches are best used for distributed hyperparameter tuning on medium-sized datasets.
For the model application stage, where we already have a deep learning model trained but need to apply the resulting model to a large user base, we can use Pandas UDFs. With Pandas UDFs, we can:
- partition and distribute our dataset.
- run the resulting dataframes against a Keras model.
- compile the results back into a single large Spark datagram.
This lesson shows how we can take the Keras model that we built in Keras Regression and scale it to larger datasets ...