Productizing PySpark

Tools and techniques for productizing PySpark.

Scheduling

Once you’ve tested a batch model pipeline in a notebook environment, there are a few different ways of scheduling the pipeline to run on a regular schedule.

For example, you may want a churn prediction model for a mobile game to run every morning and publish the scores to an application database. Similar to the workflow tools we covered in the previous chapter, a PySpark pipeline should have monitoring in place for any failures that may occur.

Techniques

There are a few different approaches for scheduling PySpark jobs to run: ...