Undersampling with NearMiss
Learn how to effectively balance the training data by undersampling the majority class.
We'll cover the following...
Real-world entity resolution tasks are severely imbalanced classification problems, suboptimal for learning. In smaller datasets, we face ratios of one match per thousands of no-matches, and in medium- to large-scale datasets, the ratio is several magnitudes worse. Applying indexing techniques can reduce the imbalance to some extent.
Let’s explore how we can improve by applying undersampling on the following precomputed dataset of similarity features, also covering the ground truth in the class
column:
import pandas as pddf = pd.read_parquet('_scores_filtered.parquet')[['name_score', 'street_score', 'class']].reset_index(drop=True)print('Class distribution - "1" means "Match":')print(df['class'].value_counts())print('\n-> Class imbalance of one match among {:.0f} no-matches.'.format(186114 / 112))print('\nExample records:')print(df.head())
We have 112 matches and 186114 no-matches in this dataset. That’s even moderate compared to many other entity resolution scenarios.
Balancing with minimal information loss
Our dataset contains 1662 no-matches for every single match. Undersampling means we preserve all 112 examples from the minority class while reducing the number of the majority class.
Size is only one dimension of the problem. We also want to minimize critical information loss while balancing the data. That’s the purpose of an undersampling algorithm, like NearMiss
from the imbalanced-learn package.
The illustration below contains eight matches. Due to the monotonic nature of our features, we can expect a decision boundary dividing the feature space into an upper-right match zone vs. the ...