Merge Center Clustering (MCC)
Learn how to apply (merge) center clustering and how it relates to transitive clustering.
We'll cover the following...
Merge center clustering (MCC) is a sweet alternative to the omnipresent transitive clustering. Both algorithms are low in computational cost, which is necessary in most entity resolution scenarios. However, according to several scientific studies, MCC is superior to transitive clustering regarding resolution quality.
We will demonstrate its workings with a toy dataset of eleven records from the following code cell. Every line represents a match prediction from the pairwise model. An absent combination means it did not make it through the indexing step, or the pairwise model predicts a no-match for that pair.
import pandas as pdscores = pd.DataFrame([# top cluster['r4', 'r3', 0.87],['r8', 'r2', 0.89],['r3', 'r1', 0.99],['r7', 'r5', 0.85],['r3', 'r0', 0.84],['r5', 'r0', 0.82],['r4', 'r1', 0.86],['r10', 'r8', 0.82],['r3', 'r2', 0.81],['r9', 'r8', 0.84],['r12', 'r8', 0.81],['r11', 'r8', 0.85],['r2', 'r1', 0.85],['r7', 'r6', 0.80],['r6', 'r5', 0.83]], columns=['id_1', 'id_2', 'score'])print(scores)
The following figure illustrates the output as a graph, where nodes represent records and the presence/absence of edges represent match predictions.
In real-world scenarios, we will face graphs consisting of thousands of connected components, where one might look like our toy example. The clustering algorithms below only alter the shape within a connected component but never across two.
Note: Two graph nodes are termed as connected if a path exists from one node to the other. A connected component is a maximal subgraph, where all nodes within the subgraph are connected.
Transitive clustering assigns the whole connected component to a single cluster. Let’s explore how MCC works.
Center clustering (CC)
MCC is an extension of center clustering (CC). The latter proceeds by the following steps:
Sort the pairs by their similarity (
score
) in descending order.Every record will be a center or a no-center node. We initialize empty lists
for center nodes, ...