An Introduction to Entity Resolution in Python/

...

Binary Classification in Entity Resolution

Get an overview of binary classification in entity resolution.

We'll cover the following...

Class imbalance and performance evaluation
Rule-based vs. learning-based models
Trading between precision and recall
Labeling
Key takeaway

We must decide for every pair of records if they belong to the same real-world entity. That’s a binary classification problem with classes “match” and “no-match.” However, the typical real-world entity resolution task is not as standard as typical classification textbook examples for different reasons.

A huge number of pairs growing quadratically with the record sample size. Most of them are trivial to classify.
A heavy class imbalance, typically with less than 0.1% actual matches.
Very few available labels (if any).

Let’s discuss some challenges and opportunities when dealing with binary classification for entity resolution.

Class imbalance and performance evaluation

Let $r_1,\ldots,r_n$ denote our sample of size $n$ , where every record $r_i$ can be a whole array of atomic record attributes. Deduplication means we must classify each pair $p_{ij}=(r_i,r_j)$ into $c_{ij}=1$ (match) or $c_{ij}=0$ (no-match) for every $1\leq i<j\leq n$ . That’s $k=n\cdot(n-1)/2$ ...

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Binary Classification in Entity Resolution

Class imbalance and performance evaluation