- Persisting Dataframes

Using persistent storages with PySpark.

Persistent storage

Common operations in PySpark are saving a dataframe to persistent storage and reading in a dataset from a storage layer. While PySpark can work with databases such as Redshift, it performs much better when using distributed file stores such as S3 or GCS. In this chapter, we’ll use these types of storage layers as the outputs of model pipelines, but it’s also useful to stage data to S3 as intermediate steps within a workflow.

For example, in the AutoModel system at Zynga, we stage the output of the feature generation step to S3 before using MLlib to train and apply machine learning models. The data storage layer to use depends on your cloud platform. For AWS, S3 works well with Spark for distributed data reads and writes.

File formats

When using S3 or other data lakes, Spark supports a variety of different file formats for persisting data.

Parquet is typically the industry standard when working with Spark, but we’ll also explore Avro and ORC in addition to CSV. Avro is a better format for streaming data pipelines, and ORC is useful when working with legacy data pipelines.

To show the range of data formats supported by Spark, we’ll take the stats dataset and write it to Avro, then Parquet, then ORC, and finally CSV. After performing this round trip of data IO, we’ll end up with our initial Spark dataframe.

Get hands-on with 1300+ tech skills courses.