Building a Machine Learning Pipeline from Scratch/

...

Course Goals and Structure

Get an overview of the goals, structure, strengths, and intended audience for this course.

We'll cover the following...

Intended audience
Course goals
What this course doesn’t teach
Course structure
Course strengths

Welcome to Building a Machine Learning Pipeline from Scratch!

Data science—specifically machine learning (ML) modeling—has, over the past decade, evolved from a purely exploratory endeavor that only large Silicon Valley-type companies could afford to a more mainstream activity that most technologically inclined companies practice. Jobs and job requirements have evolved in the same way.

In the early 2010s, PhDs hired right out of school would apply their scientific skills to solving data science problems. The only skills companies looked for were knowledge of ML and the ability to apply the scientific method to business problems. Data science jobs didn’t require deep software engineering knowledge.

Over the past decade, the landscape has changed significantly. Even fresh PhDs applying for jobs as data scientists are expected to know a decent amount of software engineering. In practice, however, the supply of people who know both data science and software engineering continues to be less than the demand.

An early-career data scientist joining a company rarely dives straight into ML. In most situations, the data scientist has to build infrastructure to train their models. This may be less of an issue in large Silicon Valley-type companies that have many ML engineers on the payroll who can build infrastructure for data scientists. However, in startups and companies that aren’t early adopters of data science—which is most companies—ML engineers may not be available to work with data scientists in building out the required infrastructure. In these companies, new hires are expected to take on the role of full stack data scientists. They not only manage the science part of data science but can also perform the engineering duties it entails.

Press + to interact

Intended audience

This course is designed for people who have:

A basic familiarity with ML
- What ML is and the steps involved in training and evaluating a model
- What it means to use a model for inference
Some prior experience writing code in Python
Basic familiarity with the command line
Basic familiarity with Git

In other words, if you’re a person with ML knowledge joining the workforce or an early-career data scientist who wants to become better at the engineering aspects of your trade, this course is for you.

Note that this course has been developed for the Linux operating system. Windows users may find that they require some minor tweaks to get things to work.

Course goals

By the end of the course, you’ll be able to:

Design a great ML training pipeline.
- Know the major components of a pipeline.
- Structure the components into a cohesive whole.
Build an ML training pipeline entirely from scratch using software engineering best practices.
- Create a pipeline that’s fully functional, from loading data to creating reports and preparing for model deployment.
- Build a pipeline that can work with new datasets and model types.
Use advanced features and libraries in Python.

Press + to interact

What this course doesn’t teach

This course doesn’t teach ML concepts or basic programming; you should already know your way around those topics before you begin.

Course structure

The entire course is a single project. We start by designing the ML pipeline and add various components to it. We’ll do an ML classification project that uses this pipeline to see a concrete example of how it works. We’ll also take some brief detours to encounter new concepts along the way. We’ll explore new Python libraries before incorporating them into our pipeline. We'll spend the first third of the course building our ML training pipeline and the remainder discussing how to deploy our trained model for production and extend it for use on new problems.

The course consists of the following chapters:

Introduction: Get a quick overview of the course.
Getting Started: Learn the importance of transitioning from Jupyter Notebooks to software engineered pipeline scripts and the definition of an ML training pipeline.
Structuring the ML Pipeline: Design your ML pipeline and learn about optimal and standard ways to organize directories and files, code style, and dependency management.
Directed Acyclic Graphs (DAGs): Explore DAGs, their utility in data and ML pipelines, and how to sort by topology.
The ML Library: Begin building the pipeline, starting with its component modules for processing data and training a model for iris classification. Learn how to use some advanced Python features and useful libraries.
The Pipeline Core: Write the top-level pipeline script and learn about command-line argument parsing, logging, and documentation.
Extending the Pipeline: Extend the pipeline to the AutoMPG dataset.
Testing: Discover unit and system testing.
Deployment: Package your pipeline library for distribution and use it to perform inference.
Other Considerations: Explore data quality and performance monitoring and the reproducibility of your results. Take a brief look at some off-the-shelf ML training frameworks.
Wrapping Up: Conclude the course and take a look at some useful resources.

Course strengths

If you're looking to build or improve strengths that will help you in your early career as a data scientist, this course offers the following benefits:

Course Benefits

Topic	Description
Data science in the industry	Learning how data science is done in the industry is essential to getting and succeeding in an early-career role as a data scientist.
ML training pipelines	The ability to understand and build ML training pipelines is crucial for large-scale modeling and deployment.
Model deployment	Knowing how to package data processing code for inference ensures models are deployed in a timely manner and reduces train-test skewness.
Software engineering	You need to know how to code for ML modeling and deployment, so it’s important to learn software engineering best practices.
Advanced Python features	Advanced Python features can help you write well-designed code.
Python libraries	Knowing how to use the Python libraries introduced in this course is useful for not just ML but any kind of software development.

Introduction

Getting Started

Structuring the ML Pipeline

Directed Acyclic Graphs (DAGs)

The ML Library

Create Your First Data Pipeline with a Dashboard

The Pipeline Core

Extending the Pipeline

Build a News ETL Data Pipeline Using Python and SQLite

Testing

Deployment

Other Considerations

Wrapping Up

Appendix

Final Assessment