Motivating Clustering

Understand the importance of clustering in entity resolution.

A typical entity resolution pipeline starts with preprocessing records r~1=C(r1),,r~n=C(rn)\tilde{r}_1=C(r_1),\ldots,\tilde r_n=C(r_n) individually. Next comes pairwise feature engineering sij=F(r~i,r~j)s_{ij}=F(\tilde{r}_i,\tilde{r}_j), followed by pairwise matching cij=M(sij)c_{ij}=M(s_{ij}), where c=1c=1 represents a match and c=0c=0 otherwise—a binary classification problem.

Collective entity resolution goes beyond pairs to improve outcomes from the collective evidence of any number of records. It is about improving the classification accuracy and resolving potential conflicts that would otherwise make the output impractical.

Clusters

Let’s reformulate our resolution task as a clustering problem on graphs. Starting from our pairwise predictions, we create a graph where nodes represent records r1,,rnr_1,\ldots,r_n ...