Transitive Clustering

Become familiar with transitive clustering and its shortcomings.

Transitive clustering is the mother of all clustering algorithms for entity resolution. Its logic is appealing, super fast to compute, and beneficial for recall. However, it is typically paid with a significant reduction in precision. Let’s begin by examining how the algorithm works on a small dataset.

Connected components

Clustering follows pairwise prediction. Let r1,,rnr_1,\ldots,r_n​ denote the original records, where cij=1c_{ij}=1 if the pairwise model predicts a match between rir_i​ and rjr_j​, and cij=0c_{ij}=0 otherwise. The following graph illustrates a toy example with n=9n=9, nodes representing records, and the presence/absence of edges representing a predicted match/no-match.

Get hands-on with 1400+ tech skills courses.