Motivating Clustering

Understand the importance of clustering in entity resolution.

A typical entity resolution pipeline starts with preprocessing records r~1=C(r1),,r~n=C(rn)\tilde{r}_1=C(r_1),\ldots,\tilde r_n=C(r_n) individually. Next comes pairwise feature engineering sij=F(r~i,r~j)s_{ij}=F(\tilde{r}_i,\tilde{r}_j), followed by pairwise matching cij=M(sij)c_{ij}=M(s_{ij}), where c=1c=1 represents a match and c=0c=0 otherwise—a binary classification problem.

Collective entity resolution goes beyond pairs to improve outcomes ...