Undersampling with NearMiss

Learn how to effectively balance the training data by undersampling the majority class.

Real-world entity resolution tasks are severely imbalanced classification problems, suboptimal for learning. In smaller datasets, we face ratios of one match per thousands of no-matches, and in medium- to large-scale datasets, the ratio is several magnitudes worse. Applying indexing techniques can reduce the imbalance to some extent.

Let’s explore how we can improve by applying undersampling on the following precomputed dataset of similarity features, also covering the ground truth in the class column:

Get hands-on with 1400+ tech skills courses.