Building a Machine Learning Pipeline from Scratch/

...

Motivation: Why Jupyter Notebooks Aren’t Enough

Learn about Jupyter Notebooks, what they can do, and more important, what they can't do.

We'll cover the following...

What are Jupyter Notebooks good for?
When are Jupyter Notebooks not the best option?
What’s the alternative?
Notes on transitioning from notebooks

Data scientists love Jupyter Notebooks. What’s not to love about them? A Notebook—Jupyter or any other kind—is web based, so it’s convenient to use. It’s interactive, so development is much easier than writing all your code in .py files and then running and debugging them in the command line. It lets us explore and visualize data and experiment with ML modeling all in one go. Everything seems perfect, right?

What are Jupyter Notebooks good for?

Before we examine the suitability of Notebooks for production, let’s first look at what they do well.

Notebooks are perfect for exploratory data analysis (EDA) because the process is largely interactive. We can load our data, examine tabular data, make plots, and experiment with data cleaning, processing, feature engineering, and even ML modeling.
We can jump to any cell we want and run them in any order, which can sometimes make things easy.
Notebooks are great for rapid prototyping of ideas.
Notebook is great for sharing data analysis work. It incorporates both code and markdown cells, so analysis logic and code can be documented well. Markdown cells can contain images, which allows for very detailed documentation. In fact, a lot of companies expect the candidates they interview to submit the solution to test problems as Jupyter Notebooks.
Notebooks are good for creating reports because they can be exported as HTML or PDF files. Unlike .ipynb files, they can easily be shared with a nontechnical audience.
Notebooks can be parameterized, so they can be run programmatically using libraries such as Papermill.
Notebooks are easy to host on a server, so data scientists don’t necessarily need to have a local development environment, a powerful local machine, and local data.
Notebooks are relatively beginner friendly.

Press + to interact

When are Jupyter Notebooks not the best option?

The flexibility of Jupyter Notebooks makes them unsuitable for production purposes. Some companies use them in production, but they're relatively rare. They also have a system of checks and balances in place to make sure that the code contained in Notebooks will be stable in production. Here are some of the disadvantages of Notebooks for production:

Cells can be run in any order, which leaves the door open to human error and makes debugging difficult.
It's hard to control and compare versions. Jupyter Notebooks, for instance, are formatted as JSON, and images are typically embedded as base64 strings. While there are tools such as nbdime that help with comparison, figuring out the changes between different versions can still be challenging.
Collaboration is also difficult because comparing Notebooks is challenging.
The nature of Notebooks tempts data scientists to copy code rather than follow software engineering best practices, such as writing modular code, docstrings, and unit tests.
Notebooks used in production add processing overhead for running the Notebook server. Proofs of concept or other simple projects that don’t depend on a lot of data processing may still use Notebooks, but for memory and intensive computing tasks, dedicated pipelines work better.
Notebooks can leak sensitive data. For example, a data scientist working on scans of sensitive documents may use a Notebook for visualizing the data and forget to clear cell output before checking it in for version control. When this happens, people who aren’t supposed to have access to this data are able to view it.

Press + to interact

What’s the alternative?

Software engineering is a decades-old discipline. Traditional software engineering practices, which don’t use notebooks, have evolved to reduce technical debt and increase reliability. Using a dedicated ML training pipeline built on software engineering principles allows data scientists to leverage years of best practices. ML bugs are some of the hardest we’ll ever encounter because they depend on not just code but configuration, hyperparameters, data versions, random seeds, and more. Using a well-designed pipeline helps reduce hard-to-identify bugs that can cause serious issues in production.

That isn’t to say that traditional software engineering tools and practices are flawless. Notebooks are still the best way to implement proofs of concept or quick demos, not to mention data analysis tasks such as EDA. But writing software the traditional way would be slower and relatively inflexible for these purposes. For example, for quick demos, we may not need to write unit tests, but a project that involves a lot of complex data processing steps will need them. When we understand the advantages and limitations of each approach, it’s easier to make sound technical decisions.

Press + to interact

When we build an ML training pipeline the traditional way, we expect to run it as a command-line program. It’ll be composed of a top-level .py file, multiple additional .py files for component modules, configuration files, and auxiliary files such as those for documentation. In addition to code that relates to functionality, the pipeline will contain code for unit and system testing. We’ll also have a mechanism for packaging some of the code into a library, which can then be published and shared with other developers.

This is especially useful because the people who write modeling code (typically data scientists) aren’t always the ones writing inference code (those are generally ML engineers or software engineers). Having a shared library prevents inference developers from copying—or worse, rewriting—data processing code used in modeling.

Notes on transitioning from notebooks

Many data scientists don't come from an engineering background, and Jupyter Notebooks have been our go-to platform for coding. As such, we may find it difficult to transition to an approach that’s more oriented toward software engineering. However, in the industry, it pays to have software engineering skills in addition to ML skills because a full stack data scientist can provide a lot of value to their company.

Here are a few things that can help in this transition, in no particular order.

Start using an integrated development environment (IDE), such as Visual Studio Code or even Vim with appropriate plug-ins. A useful feature of Visual Studio Code is that it allows you to edit and run Jupyter Notebooks.
Think of the high-level design of the software before writing any code.
Even within notebooks, write code in reusable functions rather than in code blocks that are copied.
Classify reusable functions into logical groups (e.g., data processing, modeling), and move them to separate files or modules.
Document code using comments and docstrings.

We’ll see many of these concepts in action as we progress through the course.

Introduction

Getting Started

Structuring the ML Pipeline

Directed Acyclic Graphs (DAGs)

The ML Library

Create Your First Data Pipeline with a Dashboard

The Pipeline Core

Extending the Pipeline

Build a News ETL Data Pipeline Using Python and SQLite

Testing

Deployment

Other Considerations

Wrapping Up

Appendix

Final Assessment

Motivation: Why Jupyter Notebooks Aren’t Enough

What are Jupyter Notebooks good for?

When are Jupyter Notebooks not the best option?

What’s the alternative?

Notes on transitioning from notebooks