To recap our course goals, this entire course is a single project. We start by designing the ML pipeline, and over the duration of the course, we add various components to it. As a concrete example of how we can use the pipeline, we then create an ML classification project.

In this chapter, we start with the architecture, or design, of our software. The first step in the development of any software is design. Typically, this means determining the scope of the project, identifying the various components of the system, and drawing a block diagram that shows how the various parts fit together. In addition, it includes designing what goes inside each block and including the interfaces where the blocks connect. What are the logically distinct functionalities in training a model?

Components of the pipeline

A pipeline contains the following components:

  • Loading data

  • Preprocessing data

  • Feature engineering data

  • Merging data

  • Training the model

  • Evaluating the model

  • Generating the training report

Loading data

The first component, of course, has to do with data. We need to be able to load data from a source. The source can be disk storage or a cloud storage location. Data may be streamed in or even come directly from peripheral hardware such as cameras or other sensors.

To keep things simple, our source will be the local disk. We’ll also support the ability to stream data in so we can demonstrate how our code can also be used in inference. We need to decide what kind of data our pipeline will support. Does it support tabular data? What about images or other kinds of binary data? What about text or documents? Engineering a system that works with all kinds of data is much harder than something with a narrower scope. For the sake of simplicity, we’ll stick to tabular, or structured, data.

Get hands-on with 1400+ tech skills courses.