...

/

Unbiased Mislabeling in Image Classification Using CNNs

Unbiased Mislabeling in Image Classification Using CNNs

Explore how an unbiased mislabeled dataset affects the performance of a CNN model.

In this lesson, we’ll learn about the impact of a small amount of unbiased mislabeling in a dataset. We aim to understand the consequences of poor-quality data by using a CNN model with two versions of the dataset—one with a clean dataset and the other with a mislabeled dataset. We’ll then compare the performance using the accuracy metric in order to gauge the impact of mislabeling.

Implementing unbiased mislabeling

To assess the impact of the dataset on the performance of a CNN model, we’ll take several steps to compare the results between a clean and mislabeled dataset.

Step 1: Importing libraries

The following code imports the libraries necessary to implement unbiased mislabeling:

Press + to interact
# Import necessary libraries
import keras
import numpy as np
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
import matplotlib.pyplot as plt
from tensorflow.keras.optimizers import Adam

Step 2: Loading and creating an unbiased mislabeled dataset

The code given below loads the MNIST digit dataset using the Keras library. We assume that the dataset is clean, which means that the labels given to each image in this dataset are correct. Then, we create a new dataset where we mislabel 10% of the images from each class in the dataset. This will help us to understand the impact of just a small amount of unbiased mislabeling on the model's performance.

Press + to interact
# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()
# Define the percentages for training and testing data
train_percentage = 0.25 # 15,000 images for training
test_percentage = 0.2 # 2,000 images for testing
# Calculate the number of samples based on percentages
total_train_samples = len(x_train)
total_test_samples = len(x_test)
train_samples = int(train_percentage * total_train_samples)
test_samples = int(test_percentage * total_test_samples)
# Distribute the data based on percentages
x_train = x_train[:train_samples]
y_train = y_train[:train_samples]
x_test = x_test[:test_samples]
y_test = y_test[:test_samples]
# Define the percentage of mislabeled images
mislabel_percentage = 10
# Compute the number of images to mislabel
num_mislabeled = int(len(y_train) * mislabel_percentage / 100)
# Randomly select images to mislabel
index = np.random.choice(len(y_train), size=num_mislabeled, replace=False)
# Generate new labels for the mislabeled images
new_labels = np.random.randint(0, 9, size=num_mislabeled)
# Create a copy of the original training set and replace the selected images with the mislabeled ones
x_train_mislabeled = np.copy(x_train)
y_train_mislabeled = np.copy(y_train)
x_train_mislabeled[index] = x_train[index]
y_train_mislabeled[index] = new_labels

  • ...