Training Data Collection Strategies
Learn the training data collection strategies for the machine learning systems you are going to build.
Significance of training data
A machine-learning system consists of three main components. They are the training algorithm (e.g., neural network, decision trees, etc.), training data, and features. The training data is of paramount importance. The model learns directly from the data to create and refine its rules on a given task. Therefore, inadequate, inaccurate, or irrelevant data will render even the most performant algorithms useless. The quality and quantity of training data are a big factor in determining how far you can go in our machine learning optimization task.
A lot of the progress in machine learning - and this is an unpopular opinion in academia - is driven by an increase in both computing power and data. An analogy is to build a space rocket: You need a huge rocket engine, and you need a lot of fuel. - Andrew Ng
Most real-world problems fall under the category of supervised learning problems which require labeled training data. This means that it is necessary to strategically think about the collection of labeled data to feed into your learning system.
Let’s explore strategies that will help in collecting labeled training data for our learning tasks.
Collection techniques
We will begin by looking at the training data collection techniques.
User’s interaction with pre-existing system (online)
In some cases, the user’s interaction with the pre-existing system can generate good quality training data.
📝 We will refer to this technique as online data collection in the course.
For many cases, the early version built for solving relevance or ranking problem is a rule-based system. With the rule-based system in place, you build an ML system for the task (which is then iteratively improved). So when you build the ML system, you can utilize the user’s interaction with the in-place/pre-existing system to generate training data for model training. Let’s get a better idea with an example: building a movie recommender system.
Assume that you are asked to recommend movies to a Netflix user. You will be training a model that will predict which movies are more likely to be enjoyed/viewed by the user. You need to collect positive examples (cases where user liked a particular movie) as well as negative examples (cases where the user didn’t like a particular movie) to feed into your model.
Here, the consumer of the system (the Netflix user) can generate training data through their interactions with the current system that is being used for recommending movies.
The early version for movie recommendation might be popularity-based, localization-based, rating-based, hand created model or ML-based. The important point here is that we can get training data from the user’s interaction with this system. If a user likes/watches a movie recommendation, it will count as a positive training example, but if a user dislikes/ignores a movie recommendation, it will be seen as a negative training example.
📝 The above discussion was one example, but many machine learning systems utilize the current system in place for the generation of training data.
We will discuss training data generation strategies from the current system in multiple problems in this course, such as search ranking, ads relevance and recommendation systems.
Human labelers (offline)
In other cases, the user of the system ...