Deal with Mislabeled and Imbalanced Machine Learning Datasets/

...

Simulating Unbiased Mislabeling Using Python Programming

Learn about how to simulate unbiased mislabeling in the MNIST digit dataset using Python programming.

We'll cover the following...

Step 1: Visualizing the MNIST digit dataset
- Visualizing images of the MNIST dataset
Step 2: Simulating unbiased mislabeling in the MNIST dataset
- Simulating 10% unbiased mislabeling
Step 3: Visualizing the dataset after simulating unbiased mislabeling
- Expected output
- Code explanation
Summary

The main objective of this lesson is to simulate unbiased mislabeling noise in a dataset and to visualize its impact. The lesson is structured into the following three steps:

Step 1: We’ll examine the MNIST digit dataset and analyze its characteristics in order to understand the dataset thoroughly before introducing mislabeling.
Step 2: We’ll simulate unbiased mislabeling in the MNIST dataset. By intentionally introducing mislabeled data points, we’ll simulate the effects of label noise on the dataset.
Step 3: We we’ll focus on creating visualizations that depict the impact of mislabeling on each digit within the MNIST dataset. These visualizations will help us observe the effect of unbiased mislabeling on the MNIST dataset.

Step 1: Visualizing the MNIST digit dataset

We chose the MNIST digit dataset, which contains 60,000 training images and 10,000 test images of handwritten digits, to observe the impact of unbiased mislabeling on image classification performance. The provided code visually represents the MNIST digit dataset using a bar chart. Each bar in the chart represents a digit instance, and the number of instances for each digit is displayed on top of the respective bar. Additionally, the digit labels are printed below the bar line. This visualization helps us understand the distribution and characteristics of the MNIST digit dataset.

Click the “Run” button to visualize the number of training examples for each digit in the MNIST dataset.

Press + to interact

Python 3.10.4

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '2'  # Disable warnings
from keras.datasets import mnist  # Importing the MNIST digit dataset
import matplotlib.pyplot as plt  # Importing the data visualization library
# Loading the MNIST dataset
(train_X, train_y), (test_X, test_y) = mnist.load_data()
# Counting the number of instances of each digit in the training set
digit_counts = [0] * 10
for i in train_y:
    digit_counts[i] += 1
# Plotting the number of training examples of each digit
figure, axis = plt.subplots()
bars = axis.bar(range(10), digit_counts)
axis.set_xlabel("Digits")
axis.set_ylabel("Counts")
axis.set_title("Number of Training Examples for Each Digit in MNIST Dataset")
# Adding the count labels to the bars
for bar, count in zip(bars, digit_counts):
    height = bar.get_height()
    axis.text(bar.get_x() + bar.get_width() / 2, height, count,
              ha='center', va='bottom')
plt.show()

Introduction to the Course

Getting Started

Understanding Noisy Data, Label Noise, and Its Types

Introduction to Convolutional Neural Network (CNN)

Cats vs Dogs Classification with Convolutional Neural Networks

Performance Comparison of Mislabeled and Clean Dataset

Dealing with Imbalance Dataset

Gauge the Impact of Imbalanced and Mislabeled Datasets

Comprehensive Quiz

Wrap Up

Appendix

Dealing With Small Datasets In ML

Simulating Unbiased Mislabeling Using Python Programming

Step 1: Visualizing the MNIST digit dataset