Ensemble methods in Python: Bagging

Ensemble methods in machine learning leverage the power of combining multiple models to enhance overall performance. This approach is particularly effective when individual models may have limitations or biases. One prominent ensemble technique is Bagging (also known as bootstrap aggregating).

Bagging aims to reduce overfitting and varianceVariance in machine learning refers to the sensitivity of a model to fluctuations in the training data, indicating how much the model's predictions vary with different training sets. by training multiple instances of a base model on different subsets of the training data. Each subset is obtained through bootstrap samplingBootstrap sampling is a statistical technique where subsets of a dataset are repeatedly drawn with replacement, allowing for the estimation of the variability and uncertainty associated with a sample statistic or model parameters., randomly selecting data points with replacement. The final prediction is often an average or a vote from the individual models.

Bagging algorithm
Bagging algorithm

How to implement bagging using Python

Follow the steps below to implement the bagging algorithm in Python:

1. Import the libraries

The first step is to import the required libraries, as shown in the code below:

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score

2. Load the dataset

The next step is to load the dataset. We’ll use the breast cancer dataset provided by the sklearn library. This dataset consists of 30 features. The target variable is the diagnosis where 1 represents malignant and 0 represents benign tumors. The train_test_split function divides the dataset into training and testing data.

cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=10) # Changed random_state

3. Define the base model

The next step is to choose the base models. Averaging classifier uses multiple models to calculate the weighted average. We’ll use the random forest classifier for this example. The n_estimators parameter dictates the number of trees in the forest, and random_state ensures reproducibility. Adjusting hyperparameters like n_estimators, max_depth, and max_features allows fine-tuning the model's performance.

base_model = RandomForestClassifier(n_estimators=10, max_depth=3, max_features='sqrt', random_state=42) # You can adjust hyperparameters

4. Implement bagging

We will now create an instance for the BaggingClassifier and fit the training data to train the model. The base_model parameter specifies the underlying model to be used, while n_estimators determines the number of base models in the ensemble. The random_state parameter ensures reproducibility by setting the seed for random number generation.

bagging_model = BaggingClassifier(base_model, n_estimators=50, random_state=20)
bagging_model.fit(X_train, y_train)

5. Predict and evaluate

Now, we will make the predictions on the test set and calculate accuracy.

y_pred = bagging_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Example

The following code shows how we can implement the bagging ensemble classifier in Python:

from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score
# Load and split the data
cancer = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=10) # Changed random_state
# Use RandomForestClassifier with max_features='sqrt' for randomness
base_model = RandomForestClassifier(n_estimators=10, max_depth=3, max_features='sqrt', random_state=42) # We can adjust hyperparameters
# Implement bagging with different base models for diversity
bagging_model = BaggingClassifier(base_model, n_estimators=50, random_state=20)
bagging_model.fit(X_train, y_train)
# Predict and evaluate
y_pred = bagging_model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}%".format(accuracy * 100))

Explanation:

  • Lines 1–4: These lines import the required libraries.

  • Line 7: This line loads the dataset from sklearn and stores it in the data variable.

  • Line 8: This line splits the dataset into train and test.

  • Line 11: We define RandomForestClassifier as this line’s base model for bagging.

  • Lines 14--15: Here, we create a BaggingClassifier with 50 base models and fit the bagging model on the training data. The BaggingClassifier handles the bootstrap sampling internally when fitting the model.

  • Line 18: The trained model is used to make predictions on the test data.

  • Lines 19–20: The code calculates the accuracy of the model’s predictions by comparing them to the true labels in the test set. The accuracy is printed as a percentage.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved