Building a Machine Learning Pipeline from Scratch/

...

Reproducibility

Understand the considerations for reproducibility in the pipeline.

We'll cover the following...

Causes of nonreproducibility
Enabling reproducibility

Reproducibility is of paramount importance in science, and that’s also true when it comes to data science. A model trained on a given dataset a second time, with exactly the same preprocessing and feature engineering steps and hyperparameters, should perform almost—if not exactly—the same as the first model.

Traditional software programs are deterministic and, in general, will always output the same thing if the input is fixed. But ML systems are stochastic in nature, so this isn’t the case, and it takes some effort to achieve reproducibility. Before we discuss how we can achieve reproducibility in our ML pipeline, let’s discuss the causes of nonreproducibility.

Press + to interact

Introduction

Getting Started

Structuring the ML Pipeline

Directed Acyclic Graphs (DAGs)

The ML Library

Create Your First Data Pipeline with a Dashboard

The Pipeline Core

Extending the Pipeline

Build a News ETL Data Pipeline Using Python and SQLite

Testing

Deployment

Other Considerations

Wrapping Up

Appendix

Final Assessment

Reproducibility

Causes of nonreproducibility