What is gradient boosting?

Gradient boosting or Gradient Boosted Decision Trees (GBDT) is a new yet powerful machine learning algorithm. It has been derived from the decision tree algorithm. In fact, it is an ensemble of decision trees.

Ensemble methods

In machine learning, we often combine small-scale models to make more accurate predictions. There are three types of ensemble methods. We can also combine these models based on their usage. They are given below:

  • Bagging: In this method, we combine weak estimators (models) to get good estimations. In case of a regression problem, we usually take the mean of the estimates. In case of a classification problem, we perform voting between these estimators.

  • Boosting: In this method, the estimators are built sequentially. Each estimator is dependent on the previous one.

  • Stacking: In this method, multiple estimators of different types are used to find one model that performs better. It is mainly used in meta-learning.

Note: The random forest is a bagging ensemble of decision trees.

Gradient boosting

Gradient boosting works well for both regression and classification tasks. A major difference between gradient boosting and the random forest is that the random forest creates multiple parallel estimators while gradient boosting sequentially adds new estimators to the ensemble.

Working

Gradient boosting works in the following steps:

  • Step 1: We start with initial predictions made by an immature estimator. When solving a regression problem, we take the first predictions of every data point to be the average of all the values. And for a classification problem, we take the log of the oddslog(# of True samples / # of False samples) and use it as the probability for a class prediction.

  • Step 2: Calculate loss on these predictions using a loss function. The loss function can be Mean Square Error (MSE), Mean Absolute error (MAE), and so on. To learn more about different loss functions, refer to this link.

  • Step 3: We make a new decision tree using these predictions. This tree is trained on the original dataset to learn the residualsThe difference between actual and predicted values. The values calculated by the loss function are refered to as residuals..

  • Step 4: Add this new model to the ensemble. By that, we mean the next time we make a prediction, we'll use the first estimator in combination with the new decision tree. We combine these two depending on whether we perform regression or classification. We'll learn more about the details in the next subsection.

  • Step 5: Repeat from step 2 until the defined limit for the decision trees has been reached or the improvement has stopped even with the addition of new decision trees.

The diagram below illustrates the workflow:

The working of gradient boosting

The combination of the initial predictions with the new models is more intuitive, as shown in the figure below:

The Loss vs No. of trees graph

Notice that we multiply a learning rate with the predictions from our new trees. This is because we want to learn in a more controlled way, and by controlling the learning rate, we can achieve that.

Predictions for a regression problem

Let's recall step 1. Our first predictions for all the values are the average of these values. But the next tree we train is trained to predict the residuals. In the case of a regression problem, we add the first predictions with the new predictions from the new decision trees in the ensemble.

Predictions for a classification problem

It is tricky to make predictions for a classification problem as we took the log of odds as our first prediction for all the values. To calculate residuals, we convert these initial predictions to probabilities. It is done by the following formula:

The leaves of the new trees predict probabilities. This is where the problem arises. Our initial predictions are in the log of odds, while the tree is built on probabilities. To convert these probabilities, we use a transformation technique. The most common transformation technique is given below:

In this equation, the subscript ii represents a leaf node of the tree. Now, these predictions can be added to the initial predictions.

Conclusion

Gradient boosting is a simple yet powerful tool. Even though it has been coined recently, the usage of gradient boosting in tabular or structured data makes it extremely important. Many Data Scientists have used gradient boosting to win Kaggle competitions.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved