Feature Engineering
Let's engineer some features for the Tweet ranking model.
The machine learning model is required to predict user engagement on user A’s Twitter feed. Let’s engineer features to help the model make informed predictions.
📝 The feature set shown here is the result of one brainstorm session. However, feature engineering is an iterative process. As an exercise, you are encouraged to think about more features.
Let’s begin by identifying the four main actors in a twitter feed:
- The logged-in user
- The Tweet
- Tweet’s author
- The context
Features for the model #
Now it’s time to generate features based on these actors and their interactions. A subset of the features is shown below.
Let’s discuss these features one by one.
Dense features
We will start by discussing the dense features.User-author features
These features are based on the logged-in user and the Tweet’s author. They will capture the social relationship between the user and the author of the Tweet, which is an extremely important factor in ranking the author’s Tweets. For example, if a Tweet is authored by a close friend, family member, or someone that user is highly influenced by, there is a high chance that the user would want to interact with the Tweet.
How can you capture this relationship in your signals given users are not going to specify them explicitly? Following are a few features that will effectively capture this.
User-author historical interactions
When judging the relevance of a Tweet for a user, the relationship between the user and the Tweet’s author plays an important role. It is highly likely that if the user has actively engaged with a followee in the past, they would be more interested to see a post by that person on their feed.
Few features based on the above concept can be:
-
author_liked_posts_3months
This considers the percentage of an author’s Tweets that are liked by the user in the last three months. For example, if the author created twelve posts in the last three months and the user interacted with six of these posts then the feature’s value will be:
= or
This feature shows a more recent trend in the relationship between the user and the author.
-
author_liked_posts_count_1year
This considers the number of an author’s Tweets that the user interacted with, in the last year. This feature shows a more long term trend in the relationship between the user and the author.
📝 Ideally, we should normalize the above features by the total number of Tweets that the user interacted with during these periods. This enables the model to see the real picture by cancelling out the effect of a user’s general interaction habits. For instance, let’s say user A generally tends to interact (e.g., like or comment) more while user B does not. Now, both user A and B have a hundred interactions on user C’s posts. User B’s interaction is more significant since they generally interact less. On the other hand, user A’s interaction is mostly a result of their tendency to interact more.
User-author similarity
Another immensely important feature set to predict user engagement focuses on figuring out how similar the logged-in user and the Tweet’s author are. A few ways to compute such features include:
-
common_followees
This is a simple feature that can show the similarity between the user and the author. For a user-author pair, ...