Sklearn Workflow
Building model pipelines using sklearn
Batch model pipelines workflow
A common workflow for batch model pipelines is to extract data from a data lake or data warehouse, train a model on historic user behavior, predict future user behavior for more recent data, and then save the results to a data warehouse or application database.
In the gaming industry, I’ve seen this workflow used for building the likelihood to purchase and churn models, where the game servers use these predictions to provide different treatments to users based on the model predictions. Usually, libraries like sklearn are used to develop models, and languages such as PySpark are used to scale up to the full player base.
Model pipelines
It is typical for model pipelines to require other ETLs to run in a data platform before the pipeline can run on the most recent data. For example, there may be an upstream step in the data platform that translates JSON strings into schematized events that are used as input for a model. In this situation, it might be necessary to rerun the pipeline on a day when issues occurred with the JSON transformation process.