Gradient Boosting with XGBoost
Learn about the XGBoost gradient boosting algorithm.
We'll cover the following...
A gradient boosting framework
XGBoost stands for Extreme Gradient Boosting. The XGBoost algorithm is a scalable framework for training gradient boosted ensembles using decision trees as the weak learners. The XGBoost algorithm is accessible in R via the xgboost
package and has become a go-to algorithm for production scenarios, given its predictive performance and scalability.
XGBoost is a gradient boosting framework because the algorithm’s underlying mathematics supports training ensembles using many different loss functions. In machine learning, a loss function is a mathematical definition measuring the quality of a model’s predictions.
XGBoost supports loss functions for scenarios like:
Regression
Binary classification
Multiclass classification
Cox survival models
Ranking
Loss functions are also often referred to as objective functions because the objective of machine learning algorithms is to minimize the value of the loss function.
The following sections cover essential aspects of the XGBoost algorithm’s mathematics. More information is available on the XGBoost official website.
Loss functions
XGBoost supports many loss functions and represents these functions generically with the following mathematical notation:
Where
The XGBoost algorithm relies on gradients for minimizing the loss functions. Gradients are simply derivates of the loss function. Using gradients constrains the XGBoost framework to using only differentiable loss functions.
The objective function
Gradient boosting algorithms are prone to overfitting, and the XGBoost algorithm is no exception. To combat overfitting, the XGBoost algorithm starts with the following objective function:
The objective function is the sum of the loss function over all the training dataset observations (denoted by
A regularization term adds a penalty to an objective function. This penalty encourages machine learning algorithms to produce less complex models (i.e., models less likely to overfit).
In the equation above,