What is XGBoost?

XGBoost, which stands for Extreme Gradient Boost is a machine learning library that employs gradient boosting and undergoes iterations, and produces a boosted model by adding newly trained models to the ensemble. Gradient here refers to using the gradient descent on the loss function which ultimately determines the parameters of the new model which will be added to the ensemble. At the end of the process, a better-performing model is produced.

How gradient Boosting works

How gradient boosting works
How gradient boosting works

Explanation

From the diagram above,

  1. First, we start off by making use of the single naive model (usually a model with a slightly good accuracy or metric score) in the ensemble and make predictions.

  2. The result of these predictions is then used to measure the loss function obtained by the model. Here, metrics like to mean squared error (MSE) and R-Squared can be used depending on the problem at hand.

  3. The loss function obtained from step 2 is then used to train a new model.

  4. The newly trained model is then added to the ensemble.

  5. This process continues until a better model with the least loss function is obtained. At the end of the day, we can say that the model has been boosted!

Why use XGBoost?

XGBoost outperforms the usual machine learning algorithms as it has proven to produce models with better accuracy or metric scores.

Because it runs on the python and R programming languages, it has become popular among data science professionals. Additionally, it is compatible with systems running Windows, Linux, macOS, etc.

Our dataset

To illustrate how the XGBoost is used for making predictions, we will use the Boston dataset.

# importing necessary libraries and modules
import pandas as pd
from sklearn.datasets import load_boston
boston = load_boston() # loading the dataset
data = pd.DataFrame(boston.data) # converting to a pandas DataFrame
data.columns = boston.feature_names #obtaining column names
print(f"Columns of our dataset: {data.columns}")
print("\n",f"Shape of data: {data.shape}")
print(data.head(4))

Exploratory data analysis (EDA)

We will not be doing much EDA, but we need to understand our data by using the code below:

# getting statistical description of our data
print(data.describe())
# getting information of our data
print(data.info)

Selecting features and the target variable

Here the LSTAT will serve as our target variable (i.e what our model will be predicting), while the rest of our column variables will serve as our features.

# selecting the features
X = data.drop("LSTAT", axis=1)
# selecting the target variable
y = data.LSTAT.to_frame()
print(X.head())
print(y.head())

Splitting our data

In the code below, we will use the train_test_split() method to split our data into training and validation. We will use 80% of our data for training while the rest will be used for validation.

from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8)
print("X_train: ", X_train.shape)
print("X_valid: ", X_valid.shape)
print("y_train: ", y_train.shape)
print("y_valid: ", y_valid.shape)

Data Standardization

Standardizing our data helps in making the quality of our data good for the model to learn from. We will make use of the StandardScaler() to standardize our training and validation features.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_Valid = scaler.fit_transform(X_valid)
# taking a look at one of our scaled data
print(X_train)

Using XGBoost

Now we will make use of the XGBRegressor model from xgboost which is a wrapper interface for xgboost from Scikit-Learn API. This is because we are dealing with a prediction problem. If this were a classification problem, we would be using the XGBClassifier model.

from xgboost import XGBRegressor
model = XGBRegressor()
model.fit(X_train, y_train)
prediction = model.predict(X_valid)
print(prediction)

Explanation

  • Line 1: We import the XGBRegressor model from xgboost.

  • Line 2: We declare an instance of the XGBRegressor model and the value is assigned to a variable, model.

  • Line 3: The model is then trained using the training data sets of our features and target variables.

  • Line 4: We make predictions on the validation data using the .predict() function. The result is assigned to a variable, prediction.

  • Line 5: We return the output of the prediction.

Evaluating the model

We will be using the root mean squared error (rmse) metric to evaluate our regression model.

from sklearn.metrics import mean_squared_error
score = np.sqrt(mean_squared_error(prediction, y_valid))
print(f"RMSE: {score}")

Hyperparameter tuning of the XGBRegressor

It will interest you to note that so far so good we have only used the default parameter values of the XGBRegressor model. Now, let's take a closer look at the model and explore its parameters and ultimately choose the best parameter values (hyperparameters) for a better-performing model; this process is called hyperparameter tuning. We will be making use of GridSearchCV from the sklearn.model_selection module for the tuning process.

Hyperparameters of XGBoost

Below are the most commonly used tuned hyperparameters for the XGBRegressor algorithm:

  • learning_rate (int): Typical values range between 0.01–0.2. This simply specifies how quickly the model trains or fits errors by making use of additional base learners.

  • max_depth (int): Typical values range between 1–10. This specifies how deep the nodes of the decision tree can go. It can not take a negative number.

  • gamma (int): Typical values range between 0–0.5. This takes a value signifying the loss reduction required to make a further partition on a leaf node of the tree.

  • subsample (int): Typical values range between 0.5–0.9. This is a value to represent the fraction of the training data that should be used to train each tree.

Step 1: Create a parameter grid for the tuning

A parameter grid is a dictionary that contains different values of the hyperparameter needed for the tuning process so that after tuning, the best values are chosen automatically.

We will make use of a few parameters for tuning as it can be very time-consuming when you try to make use of all the parameters for the tuning process.

# creating a parameter grid
param_grid = {
"learning_rate": [0.1, 0.01, 0.05],
"max_depth": [3, 4, 5, 7],
"gamma": [0, 0.25, 1]
}

Step 2: Import the GridSearchCV

In the code below, we will import the GridSearchCV which will be used for the tuning process.

from sklearn.model_selection import GridSearchCV
# initializing the XGBRegressor
xgb_model = XGBRegressor()
# performing the tuning operation
grid_cv = GridSearchCV(xgb_model, param_grid, n_jobs=-1, cv=3, scoring="neg_root_mean_squared_error")
# Training our model
new_model= grid_cv.fit(X_train, y_train)

Explanation

  • Line 1: We import the GridSearchCV

  • Line 3: We declare an instance of the XGBRegressor model.

  • Line 5: We perform a tuning process using the GridSearchCV .

  • Line 8: We fit the training datasets to the new model, new_model.

Step 3: Making predictions and evaluation

In the code below, we will make a prediction using the new model we just obtained from the tuning process. We will also evaluate our model using the rmse metric like we did to our first model in the previous section.

# making prediction
prediction2 = new_model.predict(X_valid)
# model evaluation
rmse = np.sqrt(mean_squared_error(prediction2, y_valid))
print("\n",f"RMSE: {rmse}")

Bravo! We can now see that the root mean squared error of the new model new_model is much better than that of the initial model model.

What were the best parameter values?

After the tuning process, a model is then created using the best parameter values. The model is then automatically built and trained using these hyperparameters on the dataset. Now how do we obtain the best hyperparameter values that were used by this model?

To obtain the best parameter values used by the new model, we simply use the .best_params_ method. We will illustrate this in the code below:

print(new_model.best_params_)

The output above is exactly the parameter values that were used by our new model to come up with a better performance.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved