Binary Classification in Entity Resolution

We must decide for every pair of records if they belong to the same real-world entity. That’s a binary classification problem with classes “match” and “no-match.” However, the typical real-world entity resolution task is not as standard as typical classification textbook examples for different reasons.

  • A huge number of pairs growing quadratically with the record sample size. Most of them are trivial to classify.

  • A heavy class imbalance, typically with less than 0.1% actual matches.

  • Very few available labels (if any).

Let’s discuss some challenges and opportunities when dealing with binary classification for entity resolution.

Class imbalance and performance evaluation

Let r1,,rnr_1,\ldots,r_n​ denote our sample of size nn, where every record rir_i​ can be a whole array of atomic record attributes. Deduplication means we must classify each pair pij=(ri,rj)p_{ij}=(r_i,r_j) into cij=1c_{ij}=1 (match) or cij=0c_{ij}=0 (no-match) for every 1i<jn1\leq i<j\leq n. That’s k=n(n1)/2k=n\cdot(n-1)/2 individual classification tasks we need to consider.

The restaurants dataset consists of n=864n=864 records, which translates to k=372816k=372816 pairs. Researchers prepared and open-sourced this data together with the ground truth so that we can evaluate our approaches.

Get hands-on with 1200+ tech skills courses.