Gradient Boosting with XGBoost

Learn about the XGBoost gradient boosting algorithm.

A gradient boosting framework

XGBoost stands for Extreme Gradient Boosting. The XGBoost algorithm is a scalable framework for training gradient boosted ensembles using decision trees as the weak learners. The XGBoost algorithm is accessible in R via the xgboost package and has become a go-to algorithm for production scenarios, given its predictive performance and scalability.

XGBoost is a gradient boosting framework because the algorithm’s underlying mathematics supports training ensembles using many different loss functions. In machine learning, a loss function is a mathematical definition measuring the quality of a model’s predictions.

XGBoost supports loss functions for scenarios like:

  • Regression

  • Binary classification

  • Multiclass classification

  • Cox survival models

  • Ranking

Loss functions are also often referred to as objective functions because the objective of machine learning algorithms is to minimize the value of the loss function.

The following sections cover essential aspects of the XGBoost algorithm’s mathematics. More information is available on the XGBoost official website.

Loss functions

XGBoost supports many loss functions and represents these functions generically with the following mathematical notation:

Where yiy_i denotes the actual values of the observation (i.e., label values in classification) and yi^\hat{y_i} denotes the XGBoost ensemble’s predictions for an observation.

The XGBoost algorithm relies on gradients for minimizing the loss functions. Gradients are simply derivates of the loss function. Using gradients constrains the XGBoost framework to using only differentiable loss functions.

The objective function

Gradient boosting algorithms are prone to overfitting, and the XGBoost algorithm is no exception. To combat overfitting, the XGBoost algorithm starts with the following objective function:

The objective function is the sum of the loss function over all the training dataset observations (denoted by nn). XGBoost then adds a regularization term to the objective function:

A regularization term adds a penalty to an objective function. This penalty encourages machine learning algorithms to produce less complex models (i.e., models less likely to overfit).

In the equation above, kkdenotes the number of trees in the ensemble and the ω(fk)\omega(f_k) ...