- Persisting Dataframes
Using persistent storages with PySpark.
We'll cover the following...
Persistent storage
Common operations in PySpark are saving a dataframe to persistent storage and reading in a dataset from a storage layer. While PySpark can work with databases such as Redshift, it performs much better when using distributed file stores such as S3 or GCS. In this chapter, we’ll use these types of storage layers as the outputs of model pipelines, but it’s also useful to stage data to S3 as intermediate steps within a workflow.
For example, in the AutoModel system at Zynga, we stage the output of the feature generation step to S3 before using MLlib to train and apply machine learning models. The data storage layer to use depends on your cloud platform. For AWS, S3 works well with Spark for distributed data reads and writes.
File formats
When using S3 or other data lakes, Spark supports a variety of different file formats for persisting data. ...