The ML Training Pipeline
Understand what constitutes a good ML training pipeline.
What’s an ML training pipeline?
As the name implies, a pipeline—sometimes also called a framework or a platform—chains together various logically distinct units or functionality tasks to form a single software system. An ML training pipeline is a pipeline for loading data, preparing it for training, and training an ML model using that data. What are the tasks that are chained together in an ML training pipeline?
As an example, ML training always requires data, so one task can be loading the data. Cleaning the loaded data and feature engineering are two more possible tasks in a pipeline. The figure below shows a simple ML training pipeline. In this example, the first task is loading data, which is done in the first block. Data then flows to the second block, in which it's preprocessed. This is followed by feature engineering. At the end of this block, we have data that’s ready for training, so we connect the model training block. After the model is trained, we evaluate it, and we finish by creating a training report.
The term pipeline and the example shown above give the impression that the components are always chained together in a linear fashion. This is true for simple cases, but, for example, if we need two datasets to train a model, we might have two branches of data loading, preprocessing, and feature engineering. The results are merged before training.
In the figure below, we see a more sophisticated ML training pipeline. This project requires N datasets, each of which needs to be loaded, preprocessed, and feature engineered. The tasks involved in processing each dataset are different, so we need to have separate blocks (i.e., separate pieces of code) for them. Once all datasets have been processed, we’ll need to merge them before training. We may also need to do additional data processing after merging the data and before training.
In this example, depending on how the system is designed, it might even be possible to run the N branches that load and process data in parallel because they aren’t dependent on each other.
It’s possible to design even more complicated pipelines, such as ones that can merge data in arbitrary ways, perform feature selection, perform hyperparameter tuning, train multiple models in parallel, and incorporate deployment steps. In general, computing tasks chained together as shown above are called directed acyclic graphs (DAGs). DAGs are graphs because they consist of multiple vertices, or tasks, connected by edges.
In the pipelines shown above, each functional block is a vertex, and the connections between them are the edges. They are directed because there’s a clear precedence regarding the order of the vertices. In other words, each block depends on zero or more previous blocks, and the processing in that block can start only after the processing in the previous blocks has been completed. Finally, and most important, they are acyclic, meaning there are no loops in the graph. If you’re familiar with neural network architectures, you may have encountered DAGs before. We’ll discuss them in more detail in a later chapter.
In this course, we’ll stick to the simple pipeline shown in the first figure. We‘ll build this simple pipeline, but we can easily extend it to the one shown in the second figure or an even more complex architecture. In a later chapter, we’ll discuss how to extend the simple pipeline we’ve built.
Characteristics of a well-designed pipeline
A well-designed ML training pipeline needs to have the following characteristics:
It must be functional.
It must be reliable. Achieving reliability involves conducting unit and system tests on the software.
It must follow widely accepted standards for software design, code organization, code style, and user experience. For example, it should follow the “Don't repeat yourself” (DRY) principle, where repeated patterns are abstracted into reusable units.
The code must be well documented.
The code must be readable. Developers must use meaningful variable names, simple logic, and comments.
The code must be maintainable. For maintainability, the pipeline should be designed in a modular fashion, and it shouldn’t be difficult to add or remove features. It should also be relatively straightforward to scale up the computational capability of the software.
It should be easy to extend the pipeline to multiple projects (e.g., from an image classification project to a cost estimation/regression project), multiple types of models (e.g., from logistic regression to XGBoost), and multiple datasets (e.g., from Iris to MNIST).
It should package data processing functionalities and expose them to other developers—such as those who deploy models to production and develop code for inference—in the form of a well-designed application programming interface (API). For example, the data scientist can provide the engineer in charge of deploying a model to production with a
preprocess_data()
method so they don’t have to write their own code for data preprocessing, which would violate the DRY principle and create potential for human error.As much as possible, it should be configurable without having to edit code.
What the pipeline won’t contain
Note that our pipeline won’t involve certain preparatory steps that are necessary for any ML project. ML projects invariably start with formulating the problem statement, followed by the identification of suitable sources of data. This is always followed by EDA. These steps aren’t part of the ML training pipeline; they are manual processes done independently of model training. As such, they aren’t covered in this course.