Persistent storage

Common operations in PySpark are saving a dataframe to persistent storage and reading in a dataset from a storage layer. While PySpark can work with databases such as Redshift, it performs much better when using distributed file stores such as S3 or GCS. In this chapter, we’ll use these types of storage layers as the outputs of model pipelines, but it’s also useful to stage data to S3 as intermediate steps within a workflow.

For example, in the AutoModel system at Zynga, we stage the output of the feature generation step to S3 before using MLlib to train and apply machine learning models. The data storage layer to use depends on your cloud platform. For AWS, S3 works well with Spark for distributed data reads and writes.

File formats

When using S3 or other data lakes, Spark supports a variety of different file formats for persisting data.

Parquet is typically the industry standard when working with Spark, but we’ll ...

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

- Persisting Dataframes

Persistent storage

File formats