Motivating Clustering

Understand the importance of clustering in entity resolution.

A typical entity resolution pipeline starts with preprocessing records r~1=C(r1),,r~n=C(rn)\tilde{r}_1=C(r_1),\ldots,\tilde r_n=C(r_n) individually. Next comes pairwise feature engineering sij=F(r~i,r~j)s_{ij}=F(\tilde{r}_i,\tilde{r}_j), followed by pairwise matching cij=M(sij)c_{ij}=M(s_{ij}), where c=1c=1 represents a match and c=0c=0 otherwise—a binary classification problem.

Collective entity resolution goes beyond pairs to improve outcomes from the collective evidence of any number of records. It is about improving the classification accuracy and resolving potential conflicts that would otherwise make the output impractical.

Clusters

Let’s reformulate our resolution task as a clustering problem on graphs. Starting from our pairwise predictions, we create a graph where nodes represent records r1,,rnr_1,\ldots,r_n​ and the presence of an edge between rir_i​ and rjr_j​ means that our model predicts a match, cij=1c_{ij}=1. Here is an example:

Get hands-on with 1400+ tech skills courses.