Training Data Generation
Let's collect and label training data for the feed ranking ML model.
Your user engagement prediction model’s performance will depend largely on the quality and quantity of the training data. So, let’s see how you can generate training data for your model.
📝 Note that the term training data row and training example will be used interchangeably.
Training data generation through online user engagement
The users’ online engagement with Tweets can give us positive and negative training examples. For instance, if you are training a single model to predict user engagement, then all the Tweets that received user engagement would be labeled as positive training examples. Similarly, the Tweets that only have impressions would be labeled as negative training examples.
📝 Impression: If a Tweet is displayed on a user’s Twitter feed, it counts as an impression. It is not necessary that the user reads it or engages with it, scrolling past it also counts as an impression.
However, as you saw in the architectural components lesson, that you can train different models, each to predict the probability of occurrence of different user actions on a tweet. The following illustration shows how the same user engagement (as above) can be used to generate training data for separate engagement prediction models.
When you generate data for the “Like” prediction model, all Tweets that the user has liked would be positive examples, and all the Tweets that they did not like would be negative examples.
📝 Note how the comment is still a negative example for the “Like” prediction model.
Similarly, for the “Comment” prediction model, all Tweets that the user commented on would be positive examples, and all the ones they did not comment on would be negative examples.