An Introduction to Entity Resolution in Python/

...

Undersampling with NearMiss

Learn how to effectively balance the training data by undersampling the majority class.

We'll cover the following...

Balancing with minimal information loss
Impact on learning
Cold start strategy
Key takeaway

Real-world entity resolution tasks are severely imbalanced classification problems, suboptimal for learning. In smaller datasets, we face ratios of one match per thousands of no-matches, and in medium- to large-scale datasets, the ratio is several magnitudes worse. Applying indexing techniques can reduce the imbalance to some extent.

Let’s explore how we can improve by applying undersampling on the following precomputed dataset of similarity features, also covering the ground truth in the class column:

Press + to interact

We have 112 matches and 186114 no-matches in this dataset. That’s even moderate compared to many other entity resolution scenarios.

Balancing with minimal information loss

Our dataset contains 1662 no-matches for every single match. Undersampling means we preserve all 112 examples from the minority class while reducing the number of the majority class.

Size is only one dimension of the problem. We also want to minimize critical information loss while balancing the data. That’s the purpose of an undersampling algorithm, like NearMiss from the imbalanced-learn package.

The illustration below contains eight matches. Due to the monotonic nature of our features, we can expect a decision boundary dividing the feature space into an upper-right match zone vs. the ...

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Undersampling with NearMiss

Balancing with minimal information loss