Home/Blog/Machine Learning/Bagging vs. Boosting in machine learning
Home/Blog/Machine Learning/Bagging vs. Boosting in machine learning

Bagging vs. Boosting in machine learning

Saif Ali
Jun 24, 2024
10 min read

Machine learning (ML) can be tricky, so practitioners explore different techniques to refine their models. Bagging and Boosting are two such ensemble methods that have shown remarkable efficacy. Let's learn more about the differences and applications of bagging vs boosting methods.

Introduction to ensemble methods#

Ensemble methods in machine learning are strategies that combine the predictions or decisions of multiple models to improve the overall predictive performance compared to using a single model. By leveraging the diversity and strengths of various base models, ensemble methods can often reduce both bias and variance, resulting in more robust and accurate predictions. These methods can be applied to various machine learning tasks, including the following:

  • Classification: Assigns input data to predefined categories or classes based on patterns in the data.

  • Regression: Predicts a continuous numerical outcome based on input data.

  • Anomaly detection: Process of identifying and flagging unusual or abnormal data points within a dataset.

A high-level view of how ensemble methods work
A high-level view of how ensemble methods work

Common ensemble methods include Bagging, Boosting, and StackingStacking is a meta-ensemble method that combines predictions from multiple base models using a higher-level model to produce a final prediction with increased accuracy., each with its own approach for combining base models to create a stronger and more reliable ensemble model. Ensemble methods have gained popularity in the field of machine learning due to their ability to enhance predictive accuracy and generalization across a wide range of applications.

Bagging: bootstrapped aggregation#

Bagging, also known as bootstrapped aggregation, offers a systematic way to harness this data variability to our advantage in a world overflowing with data.

What is Bagging?#

Bagging is a machine learning ensemble method that aims to reduce the variance of a model by averaging the predictions of multiple base models. The key idea behind Bagging is to create multiple subsets of the training data (bootstrap samples) and train a separate base model on each of these subsets. These base models can be of any type, such as decision trees, neural networks, or regression models. Once the base models are trained, Bagging combines their predictions by averaging (for regression tasks) or voting (for classification tasks) to make the final prediction. The most popular Bagging algorithm is the Random Forest, which uses Decision Trees as base models.

In the figure below, we highlight the key features of Bagging in machine learning:

Exploring the features of the Bagging ensemble method
Exploring the features of the Bagging ensemble method

In the slides presented below, we illuminate the pivotal aspects of Bagging in the realm of machine learning:

canvasAnimation-image
1 of 8

How does bagging work?#

Bagging’s primary objective is to reduce variance by leveraging multiple models’ power. Let's examine its inner workings.

  • Data sampling: Start with a dataset of size NN.

  • Model training: Train a unique model on each bootstrapped subset. Each model will differ due to variances in the subset.

  • Repeat the process: Repeat the above steps MM times.

  • Aggregation of results: Consolidate the outputs from all models.

  • Prediction for new data: Every model predicts new data points. Finalize the prediction via majority vote (classification) or averaging (regression).

To help us understand, let's look at an example:

canvasAnimation-image
1 of 9

Bagging: practical implementation in Python#

We'll walk through a hands-on implementation of Bagging using Python's scikit-learn library, focusing on the Breast Cancer dataset. Prepare your coding environment, and let's dive in.

Step 1: Import libraries#

Importing the required libraries before proceeding with any machine learning project is essential. This gives us the tools to process data, visualize results, and implement algorithms.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
import seaborn as sns
Step 2: Load and split the dataset#

We need to load our dataset before we can train our models. For this example, we're using the Breast Cancer dataset available in scikit-learn. We then split this data into training, validation, and testing sets.

data = load_breast_cancer()
X = data.data
y = data.target
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_validation, X_test, y_validation, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
Step 3: Define ensemble training methods#

Bagging involves training multiple instances of the same model on different subsamples of the dataset. Here, we've defined functions to:

  • Bootstrap samples our data.

  • Train a model on a subset of our data.

  • Create an ensemble of models.

  • Use the ensemble to make predictions.

def bootstrap_sample(data, labels, size):
indices = np.random.choice(len(data), size=size, replace=True)
return data[indices], labels[indices]
def train_model_on_subset(data, labels):
subset_data, subset_labels = bootstrap_sample(data, labels, size=len(data))
model = DecisionTreeClassifier()
model.fit(subset_data, subset_labels)
return model
def create_ensemble(data, labels, num_models):
models = []
for _ in range(num_models):
models.append(train_model_on_subset(data, labels))
return models
def ensemble_predict(models, data_point):
predictions = [model.predict([data_point])[0] for model in models]
return np.bincount(predictions).argmax()
Step 4: Train the model and create training-validation curves#

Training and validation curves provide insights into how well our model is performing. They can help diagnose issues like underfitting and overfitting.

train_scores = []
val_scores = []
for _ in range(10): # Train and evaluate 10 times
models = create_ensemble(X_train, y_train, num_models=10)
train_preds = [ensemble_predict(models, data_point) for data_point in X_train]
val_preds = [ensemble_predict(models, data_point) for data_point in X_validation]
train_scores.append(accuracy_score(y_train, train_preds))
val_scores.append(accuracy_score(y_validation, val_preds))
# Plot
plt.figure(figsize=(10, 6))
plt.plot(train_scores, label='Training Accuracy', marker='o')
plt.plot(val_scores, label='Validation Accuracy', marker='o')
plt.xlabel('Bootstrap Iteration')
plt.ylabel('Accuracy')
plt.title('Training vs. Validation Accuracy')
plt.legend()
plt.grid(True)
plt.show()

The training-validation plot generated through the above code is as follows:

Training-validation accuracy plot
Training-validation accuracy plot
Step 5: Display the confusion matrix#

A confusion matrix provides a visual representation of our model’s performance, showing where it made correct predictions and where it made errors.

predictions = [ensemble_predict(models, data_point) for data_point in X_test]
conf_matrix = confusion_matrix(y_test, predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues',
xticklabels=data.target_names,
yticklabels=data.target_names)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

The confusion matrix generated through the above code is as follows:

Confusion matrix
Confusion matrix
svg viewer
Step 6: Print evaluation metrics#

Lastly, we'll use classification_report to provide a comprehensive breakdown of our model's performance.

print(classification_report(y_test, predictions, target_names=data.target_names))

The output of the above code is as follows:

Classification report
Classification report

Bagging offers an intelligent strategy to create robust models by leveraging the power of multiple "mini" models. The Python walkthrough above gives us a glimpse into its implementation on the Breast Cancer dataset, a stepping stone to more intricate real-world scenarios.

Boosting: A sequential improvement#

When we talk about Boosting, imagine an artist meticulously fixing each mistake one by one to make their work perfect.

What is boosting?#

Boosting is another ensemble learning method that focuses on improving the accuracy of a model by sequentially training a series of base models. Unlike Bagging, where base models are trained independently, Boosting trains each base model in a way that emphasizes the examples that the previous models misclassified. The idea is to give more weight to the misclassified samples so that the subsequent models focus on these challenging cases. The final prediction is then made by combining the predictions of all base models, giving more weight to those that performed better during training. Popular Boosting algorithms include AdaBoost, Gradient Boosting, and XGBoost.

In the figure below, we highlight the key features of Boosting in machine learning:

Exploring the features of the Boosting ensemble method
Exploring the features of the Boosting ensemble method

In the slides presented below, we illuminate the pivotal aspects of Boosting in the realm of machine learning:

canvasAnimation-image
1 of 15

How does Boosting work?#

Let’s explore how Boosting works:

  • Initialization: Start with all training samples having equal weights.

  • Training weak learners: Train a model (usually a small decision tree). This model doesn’t need to be perfect; it just needs to be better than a random guess.

  • Compute errors: Identify misclassified samples. Calculate the error rate based on the weights of these misclassified samples.

  • Determine model importance: Assign the model an “importance score” using the error rate. This score tells us how much to trust this model’s predictions.

  • Update sample weights: Increase weights for misclassified samples. Decrease weights for correctly classified ones. This ensures the next model focuses more on the mistakes of the previous one.

  • Iterate: Repeat the process, training new models on the reweighted samples.

  • Combine models for prediction: For final predictions, combine the outputs of all models. Each model’s prediction is weighted by its importance score.

To fully understand Boosting, let's look at an example:

canvasAnimation-image
1 of 13

Boosting: Practical implementation in Python#

We’ll walk through a hands-on implementation of Boosting using Python’s scikit-learn library, focusing on the Breast Cancer dataset. Prepare your coding environment, and let’s dive in!

Step 1: Import libraries#

Importing the required libraries before proceeding with any machine learning project is essential. This gives us the tools to process data, visualize results, and implement algorithms.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import AdaBoostClassifier
import seaborn as sns
Step 2: Load and split the dataset#

We need to load our dataset before we can train our models. For this example, we’re using the Breast Cancer dataset available in scikit-learn. We then split this data into training, validation, and testing sets.

data = load_breast_cancer()
X = data.data
y = data.target
# Split the data into training, validation, and testing sets
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42)
X_validation, X_test, y_validation, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)
#
Step 3: Train the AdaBoost classifier#

In this step, we initialize weights and train a weak learner to understand the initial performance.

# Initialize weights
weights = np.ones(len(X_train)) / len(X_train)
# Training a weak learner
weak_learner = DecisionTreeClassifier(max_depth=1)
weak_learner.fit(X_train, y_train, sample_weight=weights)
# Predictions and Errors
predictions = weak_learner.predict(X_train)
incorrect = (predictions != y_train)
# Weighted error
error = np.dot(weights, incorrect) / np.sum(weights)
# Calculate model's importance
alpha = 0.5 * np.log((1 - error) / error)
# Update weights
weights *= np.exp(alpha * incorrect * ((weights > 0) | (alpha < 0)))
# Use AdaBoost for the iterations
clf = AdaBoostClassifier(DecisionTreeClassifier(max_depth=1), n_estimators=50)
clf.fit(X_train, y_train)
AdaBoostClassifier with a maximum depth of 1
AdaBoostClassifier with a maximum depth of 1
Step 4: Calculate and plot training and validation accuracies#

Monitoring the model’s performance on both the training and validation data provides insight into its learning curve. Here, we gather the accuracies at each Boosting iteration.

train_scores = []
val_scores = []
for stage in clf.staged_predict(X_train):
train_scores.append(accuracy_score(y_train, stage))
for stage in clf.staged_predict(X_validation):
val_scores.append(accuracy_score(y_validation, stage))
# Plotting training vs validation scores using line plot
plt.figure(figsize=(10, 6))
plt.plot(train_scores, label='Training Accuracy', marker='o')
plt.plot(val_scores, label='Validation Accuracy', marker='o')
plt.xlabel('Boosting Iteration')
plt.ylabel('Accuracy')
plt.title('Training vs. Validation Accuracy')
plt.legend()
plt.grid(True)
plt.show()

The train-validation plot of the above code is as follows:

Training-validation accuracy plot
Training-validation accuracy plot
Step 5: Confusion matrix#

To better understand where our model might misclassify data, we visualize its performance using a confusion matrix.

# Confusion Matrix
final_predictions = clf.predict(X_test)
conf_matrix = confusion_matrix(y_test, final_predictions)
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues',
xticklabels=data.target_names,
yticklabels=data.target_names)
plt.xlabel('Predicted Labels')
plt.ylabel('True Labels')
plt.title('Confusion Matrix')
plt.show()

The confusion matrix generated through the above code is as follows:

Confusion matrix
Confusion matrix
Step 6: Classification report#

Lastly, we’ll use classification_report to provide a comprehensive breakdown of our model’s performance.

# Classification Report
print(classification_report(y_test, final_predictions, target_names=data.target_names))

The output of the above code is as follows:

Classification report
Classification report

The implementation above gives an idea of how AdaBoost works, which is a kind of Boosting. It simplifies many details for clarity, but this gives a foundation for building and exploring more sophisticated Boosting methods.


Comparing Bagging and Boosting#

Bagging and Boosting are both ensemble methods used to improve the performance of machine learning models, but they have distinct approaches and characteristics. Here's a brief overview of bagging vs. boosting followed by a comparative table:

Characteristic

Bagging

Boosting

Primary Objective

Reduce variance

Reduce bias and variance

Model

Independence

Models are independent and can be trained in parallel

Models are dependent on the errors of the previous ones and are trained sequentially

Sampling

Technique

Bootstrapping (random sampling with replacement)

Weighted sampling based on previous errors

Weight Update

Weights of data points are not adjusted

Weights of misclassified points are increased


Combination

Method

Averages predictions (for regression) or takes a majority vote (for classification)

Weighs model predictions based on their accuracy, then averages (for regression) or takes a weighted vote (for classification)

Risk of

Overfitting

Lower, thanks to averaging out individual model errors

Higher, especially with a large number of weak learners

Typical

Algorithms

Bagged Decision Trees, Random Forest

AdaBoost, Gradient Boosting, XGBoost

Speed

Typically faster because models can be trained in parallel

Slower due to the sequential nature of model training

Which one to choose?#

Choosing between Bagging and Boosting depends on various factors, including the nature of the data, the primary problem being faced (e.g., overfitting vs. underfitting), and specific performance metrics of interest. While both methods can enhance the performance of machine learning algorithms, they serve different primary objectives and possess unique characteristics.

Making the right choice often requires experimentation and a deep understanding of the underlying data and problem. Below is a table that gives guidance on when to opt for one method over the other based on certain scenarios or requirements:

Scenario

Bagging

Boosting

Problem with High Variance

Preferred because Bagging aims to reduce variance by averaging predictions.

Can be used, but the primary objective is to reduce bias and variance.

Problem with High Bias

Might not be as effective since the primary focus is on reducing variance.

Preferred because Boosting specifically targets reducing bias through sequential improvements.

Overfitting Concerns

Safer choice; tends to reduce overfitting due to its averaging nature.

Could lead to overfitting, especially with too many iterations or weak learners.

Need for Model Interpretability

Generally less interpretable due to multiple models (except when using simple models like decision trees).

Sequential nature can make it harder to interpret, especially with many weak learners.

Computational Efficiency

Often faster since models can be trained in parallel.

Typically slower because models are trained sequentially based on previous errors.


Larger Datasets

More suitable, especially with techniques like Random Forest, which handles large datasets well.

Might be computationally intensive with larger datasets due to sequential training.

Desire for Model Diversity

Achieves diversity through bootstrapped samples.

Achieves diversity by focusing on previous model's errors.

It's essential to remember that the theoretical guidance provided in the table is a starting point. Practical model selection should always involve experimentation on the specific dataset in question. Different datasets or slight changes in problem definitions might lead to unexpected outcomes. Therefore, it's beneficial to try both methods and compare their performances on a validation set before finalizing a decision.

Next steps#

If you want to expand your knowledge and learn machine learning further, the following courses are an excellent starting point for you:

Mastering Machine Learning Theory and Practice

Cover
Mastering Machine Learning Theory and Practice

The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.

36hrs
Beginner
109 Playgrounds
10 Quizzes

An Introductory Guide to Data Science and Machine Learning

Cover
An Introductory Guide to Data Science and Machine Learning

There is a lot of dispersed, and somewhat conflicting information on the internet when it comes to data science, making it tough to know where to start. Don't worry. This course will get you familiar with the state of data science and the related fields such as machine learning and big data. You will be going through the fundamental concepts and libraries which are essential to solve any problem in this field. You will work on real-time projects from Kaggle while also honing your mathematical skills which will be used extensively in most problems you face. You will also be taken through a systematic approach to learning about data acquisition to data wrangling and everything in between. This is your all-in-one guide to becoming a confident data scientist.

6hrs
Beginner
63 Playgrounds
160 Illustrations

Data Science Projects with Python

Cover
Data Science Projects with Python

As businesses gather vast amounts of data, machine learning is becoming an increasingly valuable tool for utilizing data to deliver cutting-edge predictive models that support informed decision-making. In this course, you will work on a data science project with a realistic dataset to create actionable insights for a business. You’ll begin by exploring the dataset and cleaning it using pandas. Next, you will learn to build and evaluate logistic regression classification models using scikit-learn. You will explore the bias-variance trade-off by examining how the logistic regression model can be extended to address the overfitting problem. Then, you will train and visualize decision tree models. You'll learn about gradient boosting and understand how SHAP values can be used to explain model predictions. Finally, you’ll learn to deliver a model to the client and monitor it after deployment. By the end of the course, you will have a deep understanding of how data science can deliver real value to businesses.

24hrs
Beginner
52 Playgrounds
7 Quizzes

Frequently Asked Questions

What’s the difference between bagging and boosting?

Bagging (Bootstrap Aggregating) and boosting are both ensemble learning techniques that combine multiple models to improve performance, but they differ in approach. Bagging involves training multiple models independently using different random subsets of the training data and then averaging their predictions to reduce variance and prevent overfitting. In contrast, boosting sequentially trains models, where each model focuses on correcting the errors of its predecessor, by giving more weight to misclassified instances. This iterative process aims to reduce bias and improve the model’s accuracy.

Is XGBoost bagging or boosting?

How boosting reduces bias?

Does bagging reduce overfitting?

Is dropout bagging or boosting?

Why does boosting not overfit?

What is benefit of bagging?

What is bagging with an example?

What is the concept of bagging?

What is the difference between bagging boosting and stacking in MLT?

What is bagging strategy?

Is decision tree bagging or boosting?

Is bagging or boosting better for Overfitting?


  

Free Resources