...
/Unbiased Mislabeling in Image Classification Using CNNs
Unbiased Mislabeling in Image Classification Using CNNs
Explore how an unbiased mislabeled dataset affects the performance of a CNN model.
In this lesson, we’ll learn about the impact of a small amount of unbiased mislabeling in a dataset. We aim to understand the consequences of poor-quality data by using a CNN model with two versions of the dataset—one with a clean dataset and the other with a mislabeled dataset. We’ll then compare the performance using the accuracy metric in order to gauge the impact of mislabeling.
Implementing unbiased mislabeling
To assess the impact of the dataset on the performance of a CNN model, we’ll take several steps to compare the results between a clean and mislabeled dataset.
Step 1: Importing libraries
The following code imports the libraries necessary to implement unbiased mislabeling:
# Import necessary librariesimport kerasimport numpy as npfrom keras.datasets import mnistfrom keras.models import Sequentialfrom keras.layers import Conv2D, MaxPooling2D, Flatten, Denseimport matplotlib.pyplot as pltfrom tensorflow.keras.optimizers import Adam
Step 2: Loading and creating an unbiased mislabeled dataset
The code given below loads the MNIST digit dataset using the Keras library. We assume that the dataset is clean, which means that the labels given to each image in this dataset are correct. Then, we create a new dataset where we mislabel 10% of the images from each class in the dataset. This will help us to understand the impact of just a small amount of unbiased mislabeling on the model's performance.
# Load the MNIST dataset(x_train, y_train), (x_test, y_test) = mnist.load_data()# Define the percentages for training and testing datatrain_percentage = 0.25 # 15,000 images for trainingtest_percentage = 0.2 # 2,000 images for testing# Calculate the number of samples based on percentagestotal_train_samples = len(x_train)total_test_samples = len(x_test)train_samples = int(train_percentage * total_train_samples)test_samples = int(test_percentage * total_test_samples)# Distribute the data based on percentagesx_train = x_train[:train_samples]y_train = y_train[:train_samples]x_test = x_test[:test_samples]y_test = y_test[:test_samples]# Define the percentage of mislabeled imagesmislabel_percentage = 10# Compute the number of images to mislabelnum_mislabeled = int(len(y_train) * mislabel_percentage / 100)# Randomly select images to mislabelindex = np.random.choice(len(y_train), size=num_mislabeled, replace=False)# Generate new labels for the mislabeled imagesnew_labels = np.random.randint(0, 9, size=num_mislabeled)# Create a copy of the original training set and replace the selected images with the mislabeled onesx_train_mislabeled = np.copy(x_train)y_train_mislabeled = np.copy(y_train)x_train_mislabeled[index] = x_train[index]y_train_mislabeled[index] = new_labels
...