Ensemble methods in machine learning leverage the power of combining multiple models to enhance overall performance. This approach is particularly effective when individual models may have limitations or biases. One prominent ensemble technique is Bagging (also known as bootstrap aggregating).
Bagging aims to reduce overfitting and
Follow the steps below to implement the bagging algorithm in Python:
The first step is to import the required libraries, as shown in the code below:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_breast_cancerfrom sklearn.metrics import accuracy_score
The next step is to load the dataset. We’ll use the breast cancer dataset provided by the sklearn
library. This dataset consists of 30 features. The target variable is the diagnosis where 1
represents malignant and 0
represents benign tumors. The train_test_split function divides the dataset into training and testing data.
cancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=10) # Changed random_state
The next step is to choose the base models. Averaging classifier uses multiple models to calculate the weighted average. We’ll use the random forest classifier for this example. The n_estimators
parameter dictates the number of trees in the forest, and random_state
ensures reproducibility. Adjusting hyperparameters like n_estimators
, max_depth
, and max_features
allows fine-tuning the model's performance.
base_model = RandomForestClassifier(n_estimators=10, max_depth=3, max_features='sqrt', random_state=42) # You can adjust hyperparameters
We will now create an instance for the BaggingClassifier
and fit the training data to train the model. The base_model
parameter specifies the underlying model to be used, while n_estimators
determines the number of base models in the ensemble. The random_state
parameter ensures reproducibility by setting the seed for random number generation.
bagging_model = BaggingClassifier(base_model, n_estimators=50, random_state=20)bagging_model.fit(X_train, y_train)
Now, we will make the predictions on the test set and calculate accuracy.
y_pred = bagging_model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print("Accuracy: {:.2f}%".format(accuracy * 100))
The following code shows how we can implement the bagging
ensemble classifier in Python:
from sklearn.ensemble import BaggingClassifier, RandomForestClassifierfrom sklearn.model_selection import train_test_splitfrom sklearn.datasets import load_breast_cancerfrom sklearn.metrics import accuracy_score# Load and split the datacancer = load_breast_cancer()X_train, X_test, y_train, y_test = train_test_split(cancer.data, cancer.target, test_size=0.2, random_state=10) # Changed random_state# Use RandomForestClassifier with max_features='sqrt' for randomnessbase_model = RandomForestClassifier(n_estimators=10, max_depth=3, max_features='sqrt', random_state=42) # We can adjust hyperparameters# Implement bagging with different base models for diversitybagging_model = BaggingClassifier(base_model, n_estimators=50, random_state=20)bagging_model.fit(X_train, y_train)# Predict and evaluatey_pred = bagging_model.predict(X_test)accuracy = accuracy_score(y_test, y_pred)print("Accuracy: {:.2f}%".format(accuracy * 100))
Lines 1–4: These lines import the required libraries.
Line 7: This line loads the dataset from sklearn
and stores it in the data
variable.
Line 8: This line splits the dataset into train and test.
Line 11: We define RandomForestClassifier
as this line’s base model for bagging.
Lines 14--15: Here, we create a BaggingClassifier
with 50 base models and fit the bagging model on the training data. The BaggingClassifier
handles the bootstrap sampling internally when fitting the model.
Line 18: The trained model is used to make predictions on the test data.
Lines 19–20: The code calculates the accuracy of the model’s predictions by comparing them to the true labels in the test set. The accuracy is printed as a percentage.
Free Resources