Imbalanced Datasets and Techniques to Handle Them

Learn about the class imbalance, how to deal with it, and an overview of the data to move further.

We'll cover the following

Class imbalance is a common problem in classification datasets, where the number of data points or observations is not the same across all the classes in the target column. The smaller differences are not a problem. However, there are cases when the dataset has an extreme class imbalance. For example:

Disease screening: We got the dataset to develop a machine learning model that can screen COVID-19 patients. We have only five COVID-19 positive cases in the dataset against 95 COVID-19 negative cases. Say we have 1,000 observations (100 positive and 900 negative cases).

Suppose we train our model on this COVID-19 dataset, and we are happy to see the classifier’s accuracy above 95% with minimal effort. Can we trust the model trained on the dataset with the class distribution of 5:95? It’s an accuracy paradox, where the numbers reflect the underlying class distribution in the imbalanced dataset. Let’s think about it. The baseline accuracy is 95%.

Fraud detection: Another convenient example where only a small fraction of fraudulent cases are present against the fair ones. Sometimes even 1:1000 or 1:5000.

Class imbalance in the dataset can cause frustration and needs to be treated. The following options present the solutions to handle this issue.

Get hands-on with 1200+ tech skills courses.