What is gradient boosting?

Gradient boosting or Gradient Boosted Decision Trees (GBDT) is a new yet powerful machine learning algorithm. It has been derived from the decision tree algorithm. In fact, it is an ensemble of decision trees.

Ensemble methods

In machine learning, we often combine small-scale models to make more accurate predictions. There are three types of ensemble methods. We can also combine these models based on their usage. They are given below:

Bagging: In this method, we combine weak estimators (models) to get good estimations. In case of a regression problem, we usually take the mean of the estimates. In case of a classification problem, we perform voting between these estimators.
Boosting: In this method, the estimators are built sequentially. Each estimator is dependent on the previous one.
Stacking: In this method, multiple estimators of different types are used to find one model that performs better. It is mainly used in meta-learning.

Note: The random forest is a bagging ensemble of decision trees.

Gradient boosting

Gradient boosting works well for both regression and classification tasks. A major difference between gradient boosting and the random forest is that the random forest creates multiple parallel estimators while gradient boosting sequentially adds new estimators to the ensemble.

Working

Gradient boosting works in the following steps:

Step 1: We start with initial predictions made by an immature estimator. When solving a regression problem, we take the first predictions of every data point to be the average of all the values. And for a classification problem, we take the log of the oddslog(# of True samples / # of False samples) and use it as the probability for a class prediction.
Step 2: Calculate loss on these predictions using a loss function. The loss function can be Mean Square Error (MSE), Mean Absolute error (MAE), and so on. To learn more about different loss functions, refer to this link.
Step 3: We make a new decision tree using these predictions. This tree is trained on the original dataset to learn the residualsThe difference between actual and predicted values. The values calculated by the loss function are refered to as residuals..
Step 4: Add this new model to the ensemble. By that, we mean the next time we make a prediction, we'll use the first estimator in combination with the new decision tree. We combine these two depending on whether we perform regression or classification. We'll learn more about the details in the next subsection.
Step 5: Repeat from step 2 until the defined limit for the decision trees has been reached or the improvement has stopped even with the addition of new decision trees.

The diagram below illustrates the workflow:

Notice that we multiply a learning rate with the predictions from our new trees. This is because we want to learn in a more controlled way, and by controlling the learning rate, we can achieve that.

Predictions for a regression problem

Let's recall step 1. Our first predictions for all the values are the average of these values. But the next tree we train is trained to predict the residuals. In the case of a regression problem, we add the first predictions with the new predictions from the new decision trees in the ensemble.

Predictions for a classification problem

It is tricky to make predictions for a classification problem as we took the log of odds as our first prediction for all the values. To calculate residuals, we convert these initial predictions to probabilities. It is done by the following formula:

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments