Gradient boosting or Gradient Boosted Decision Trees (GBDT) is a new yet powerful machine learning algorithm. It has been derived from the decision tree algorithm. In fact, it is an ensemble of decision trees.
In machine learning, we often combine small-scale models to make more accurate predictions. There are three types of ensemble methods. We can also combine these models based on their usage. They are given below:
Bagging: In this method, we combine weak estimators (models) to get good estimations. In case of a regression problem, we usually take the mean of the estimates. In case of a classification problem, we perform voting between these estimators.
Boosting: In this method, the estimators are built sequentially. Each estimator is dependent on the previous one.
Stacking: In this method, multiple estimators of different types are used to find one model that performs better. It is mainly used in meta-learning.
Note: The random forest is a bagging ensemble of decision trees.
Gradient boosting works well for both regression and classification tasks. A major difference between gradient boosting and the random forest is that the random forest creates multiple parallel estimators while gradient boosting sequentially adds new estimators to the ensemble.
Gradient boosting works in the following steps:
Step 1: We start with initial predictions made by an immature estimator. When solving a regression problem, we take the first predictions of every data point to be the average of all the values. And for a classification problem, we take the
Step 2: Calculate loss on these predictions using a loss function. The loss function can be Mean Square Error (MSE), Mean Absolute error (MAE), and so on. To learn more about different loss functions, refer to this link.
Step 3: We make a new decision tree using these predictions. This tree is trained on the original dataset to learn the
Step 4: Add this new model to the ensemble. By that, we mean the next time we make a prediction, we'll use the first estimator in combination with the new decision tree. We combine these two depending on whether we perform regression or classification. We'll learn more about the details in the next subsection.
Step 5: Repeat from step 2 until the defined limit for the decision trees has been reached or the improvement has stopped even with the addition of new decision trees.
The diagram below illustrates the workflow:
The combination of the initial predictions with the new models is more intuitive, as shown in the figure below:
Notice that we multiply a learning rate with the predictions from our new trees. This is because we want to learn in a more controlled way, and by controlling the learning rate, we can achieve that.
Let's recall step 1. Our first predictions for all the values are the average of these values. But the next tree we train is trained to predict the residuals. In the case of a regression problem, we add the first predictions with the new predictions from the new decision trees in the ensemble.
It is tricky to make predictions for a classification problem as we took the log of odds as our first prediction for all the values. To calculate residuals, we convert these initial predictions to probabilities. It is done by the following formula:
The leaves of the new trees predict probabilities. This is where the problem arises. Our initial predictions are in the log of odds, while the tree is built on probabilities. To convert these probabilities, we use a transformation technique. The most common transformation technique is given below:
In this equation, the subscript
Gradient boosting is a simple yet powerful tool. Even though it has been coined recently, the usage of gradient boosting in tabular or structured data makes it extremely important. Many Data Scientists have used gradient boosting to win Kaggle competitions.
Free Resources