Reproducibility

Understand the considerations for reproducibility in the pipeline.

Reproducibility is of paramount importance in science, and that’s also true when it comes to data science. A model trained on a given dataset a second time, with exactly the same preprocessing and feature engineering steps and hyperparameters, should perform almost—if not exactly—the same as the first model.

Traditional software programs are deterministic and, in general, will always output the same thing if the input is fixed. But ML systems are stochastic in nature, so this isn’t the case, and it takes some effort to achieve reproducibility. Before we discuss how we can achieve reproducibility in our ML pipeline, let’s discuss the causes of nonreproducibility.

Press + to interact

Causes of nonreproducibility

The following are some of the places in an ML training pipeline where changes or randomness can affect the reproducibility of model output.

  • Input data: If the input data is accidentally or otherwise changed, the output of the pipeline will be different. In our pipeline, we specify the input data in a config file, so it‘s important to ensure that this file is unchanged across runs.

  • Pipeline tasks: If pipeline tasks are different across runs, the output ...