...

/

Binary Classification in Entity Resolution

Binary Classification in Entity Resolution

Get an overview of binary classification in entity resolution.

We must decide for every pair of records if they belong to the same real-world entity. That’s a binary classification problem with classes “match” and “no-match.” However, the typical real-world entity resolution task is not as standard as typical classification textbook examples for different reasons.

  • A huge number of pairs growing quadratically with the record sample size. Most of them are trivial to classify.

  • A heavy class imbalance, typically with less than 0.1% actual matches.

  • Very few available labels (if any).

Let’s discuss some challenges and opportunities when dealing with binary classification for entity resolution.

Class imbalance and performance evaluation

Let r1,,rnr_1,\ldots,r_n​ denote our sample of size nn, where every record rir_i​ can be a whole array of atomic record attributes. Deduplication means we must classify each pair pij=(ri,rj)p_{ij}=(r_i,r_j) into cij=1c_{ij}=1 (match) or cij=0c_{ij}=0 (no-match) for every 1i<jn1\leq i<j\leq n. That’s k=n(n1)/2k=n\cdot(n-1)/2 ...