XGBoost, which stands for Extreme Gradient Boost is a machine learning library that employs gradient boosting and undergoes iterations, and produces a boosted model by adding newly trained models to the ensemble. Gradient here refers to using the gradient descent on the loss function which ultimately determines the parameters of the new model which will be added to the ensemble. At the end of the process, a better-performing model is produced.
From the diagram above,
First, we start off by making use of the single naive model (usually a model with a slightly good accuracy or metric score) in the ensemble and make predictions.
The result of these predictions is then used to measure the loss function obtained by the model. Here, metrics like to mean squared error (MSE) and R-Squared can be used depending on the problem at hand.
The loss function obtained from step 2 is then used to train a new model.
The newly trained model is then added to the ensemble.
This process continues until a better model with the least loss function is obtained. At the end of the day, we can say that the model has been boosted!
XGBoost outperforms the usual machine learning algorithms as it has proven to produce models with better accuracy or metric scores.
Because it runs on the python and R programming languages, it has become popular among data science professionals. Additionally, it is compatible with systems running Windows, Linux, macOS, etc.
To illustrate how the XGBoost is used for making predictions, we will use the Boston dataset.
# importing necessary libraries and modulesimport pandas as pdfrom sklearn.datasets import load_bostonboston = load_boston() # loading the datasetdata = pd.DataFrame(boston.data) # converting to a pandas DataFramedata.columns = boston.feature_names #obtaining column namesprint(f"Columns of our dataset: {data.columns}")print("\n",f"Shape of data: {data.shape}")print(data.head(4))
We will not be doing much EDA, but we need to understand our data by using the code below:
# getting statistical description of our dataprint(data.describe())# getting information of our dataprint(data.info)
Here the LSTAT
will serve as our target variable (i.e what our model will be predicting), while the rest of our column variables will serve as our features.
# selecting the featuresX = data.drop("LSTAT", axis=1)# selecting the target variabley = data.LSTAT.to_frame()print(X.head())print(y.head())
In the code below, we will use the train_test_split()
method to split our data into training and validation. We will use 80% of our data for training while the rest will be used for validation.
from sklearn.model_selection import train_test_splitX_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8)print("X_train: ", X_train.shape)print("X_valid: ", X_valid.shape)print("y_train: ", y_train.shape)print("y_valid: ", y_valid.shape)
Standardizing our data helps in making the quality of our data good for the model to learn from. We will make use of the StandardScaler()
to standardize our training and validation features.
from sklearn.preprocessing import StandardScalerscaler = StandardScaler()X_train = scaler.fit_transform(X_train)X_Valid = scaler.fit_transform(X_valid)# taking a look at one of our scaled dataprint(X_train)
Now we will make use of the XGBRegressor
model from xgboost
which is a wrapper interface for xgboost
from Scikit-Learn API. This is because we are dealing with a prediction problem. If this were a classification problem, we would be using the XGBClassifier
model.
from xgboost import XGBRegressormodel = XGBRegressor()model.fit(X_train, y_train)prediction = model.predict(X_valid)print(prediction)
Line 1: We import the XGBRegressor
model from xgboost
.
Line 2: We declare an instance of the XGBRegressor
model and the value is assigned to a variable, model
.
Line 3: The model is then trained using the training data sets of our features and target variables.
Line 4: We make predictions on the validation data using the .predict()
function. The result is assigned to a variable, prediction
.
Line 5: We return the output of the prediction.
We will be using the root mean squared error (rmse
) metric to evaluate our regression model.
from sklearn.metrics import mean_squared_errorscore = np.sqrt(mean_squared_error(prediction, y_valid))print(f"RMSE: {score}")
It will interest you to note that so far so good we have only used the default parameter values of the XGBRegressor
model. Now, let's take a closer look at the model and explore its parameters and ultimately choose the best parameter values (hyperparameters) for a better-performing model; this process is called hyperparameter tuning
. We will be making use of GridSearchCV
from the sklearn.model_selection
module for the tuning process.
Below are the most commonly used tuned hyperparameters for the XGBRegressor
algorithm:
learning_rate
(int
): Typical values range between 0.01–0.2. This simply specifies how quickly the model trains or fits errors by making use of additional base learners.
max_depth
(int
): Typical values range between 1–10. This specifies how deep the nodes of the decision tree can go. It can not take a negative number.
gamma
(int
): Typical values range between 0–0.5. This takes a value signifying the loss reduction required to make a further partition on a leaf node of the tree.
subsample
(int
): Typical values range between 0.5–0.9. This is a value to represent the fraction of the training data that should be used to train each tree.
A parameter grid is a dictionary that contains different values of the hyperparameter needed for the tuning process so that after tuning, the best values are chosen automatically.
We will make use of a few parameters for tuning as it can be very time-consuming when you try to make use of all the parameters for the tuning process.
# creating a parameter gridparam_grid = {"learning_rate": [0.1, 0.01, 0.05],"max_depth": [3, 4, 5, 7],"gamma": [0, 0.25, 1]}
In the code below, we will import the GridSearchCV
which will be used for the tuning process.
from sklearn.model_selection import GridSearchCV# initializing the XGBRegressorxgb_model = XGBRegressor()# performing the tuning operationgrid_cv = GridSearchCV(xgb_model, param_grid, n_jobs=-1, cv=3, scoring="neg_root_mean_squared_error")# Training our modelnew_model= grid_cv.fit(X_train, y_train)
Line 1: We import the GridSearchCV
Line 3: We declare an instance of the XGBRegressor
model.
Line 5: We perform a tuning process using the GridSearchCV
.
Line 8: We fit the training datasets to the new model, new_model
.
In the code below, we will make a prediction using the new model we just obtained from the tuning process. We will also evaluate our model using the rmse
metric like we did to our first model in the previous section.
# making predictionprediction2 = new_model.predict(X_valid)# model evaluationrmse = np.sqrt(mean_squared_error(prediction2, y_valid))print("\n",f"RMSE: {rmse}")
Bravo! We can now see that the root mean squared error of the new model new_model
is much better than that of the initial model model
.
After the tuning process, a model is then created using the best parameter values. The model is then automatically built and trained using these hyperparameters on the dataset. Now how do we obtain the best hyperparameter values that were used by this model?
To obtain the best parameter values used by the new model, we simply use the .best_params_
method. We will illustrate this in the code below:
print(new_model.best_params_)
The output above is exactly the parameter values that were used by our new model to come up with a better performance.
Free Resources