Reproducibility

Understand the considerations for reproducibility in the pipeline.

Reproducibility is of paramount importance in science, and that’s also true when it comes to data science. A model trained on a given dataset a second time, with exactly the same preprocessing and feature engineering steps and hyperparameters, should perform almost—if not exactly—the same as the first model.

Traditional software programs are deterministic and, in general, will always output the same thing if the input is fixed. But ML systems are stochastic in nature, so this isn’t the case, and it takes some effort to achieve reproducibility. Before we discuss how we can achieve reproducibility in our ML pipeline, let’s discuss the causes of nonreproducibility.

Get hands-on with 1200+ tech skills courses.