Grokking the Machine Learning Interview/

...

Training Data Collection Strategies

Learn the training data collection strategies for the machine learning systems you are going to build.

We'll cover the following...

Significance of training data
Collection techniques
- User’s interaction with pre-existing system (online)
- Human labelers (offline)
Additional creative collection techniques
Train, test, & validation splits
Quantity of training data
Training data filtering

Significance of training data

A machine-learning system consists of three main components. They are the training algorithm (e.g., neural network, decision trees, etc.), training data, and features. The training data is of paramount importance. The model learns directly from the data to create and refine its rules on a given task. Therefore, inadequate, inaccurate, or irrelevant data will render even the most performant algorithms useless. The quality and quantity of training data are a big factor in determining how far you can go in our machine learning optimization task.

Collection techniques

We will begin by looking at the training data collection techniques.

User’s interaction with pre-existing system (online)

In some cases, the user’s interaction with the pre-existing system can generate good quality training data.

📝 We will refer to this technique as online data collection in the course.

For many cases, the early version built for solving relevance or ranking problem is a rule-based system. With the rule-based system in place, you build an ML system for the task (which is then iteratively improved). So when you build the ML system, you can utilize the user’s interaction with the in-place/pre-existing system to generate training data for model training. Let’s get a better idea with an example: building a movie recommender system.

Assume that you are asked to recommend movies to a Netflix user. You will be training a model that will predict which movies are more likely to be enjoyed/viewed by the user. You need to collect positive examples (cases where user liked a particular movie) as well as negative examples (cases where the user didn’t like a particular movie) to feed into your model.

Here, the consumer of the system (the Netflix user) can generate training data through their interactions with the current system that is being used for recommending movies.

Press + to interact

The early version for movie recommendation might be popularity-based, localization-based, rating-based, hand created model or ML-based. The important point here is that we can get training data from the user’s interaction with this system. If a user likes/watches a movie recommendation, it will count as a positive training example, but if a user dislikes/ignores a movie recommendation, it will be seen as a negative training example.

📝 The above discussion was one example, but many machine learning systems utilize the current system in place for the generation of training data.

We will discuss training data generation strategies from the current system in multiple problems in this course, such as search ranking, ads relevance and recommendation systems.

Introduction

Practical ML Techniques/Concepts

Search Ranking

Feed Based System

Recommendation System

Self-Driving Car: Image Segmentation

Entity Linking System

Ad Prediction System

Training Data Collection Strategies

Significance of training data

Collection techniques

User’s interaction with pre-existing system (online)

Human labelers (offline)