...

/

Merge Center Clustering (MCC)

Merge Center Clustering (MCC)

Learn how to apply (merge) center clustering and how it relates to transitive clustering.

Merge center clustering (MCC) is a sweet alternative to the omnipresent transitive clustering. Both algorithms are low in computational cost, which is necessary in most entity resolution scenarios. However, according to several scientific studies, MCC is superior to transitive clustering regarding resolution quality.

We will demonstrate its workings with a toy dataset of eleven records from the following code cell. Every line represents a match prediction from the pairwise model. An absent combination means it did not make it through the indexing step, or the pairwise model predicts a no-match for that pair.

Press + to interact
import pandas as pd
scores = pd.DataFrame([
# top cluster
['r4', 'r3', 0.87],
['r8', 'r2', 0.89],
['r3', 'r1', 0.99],
['r7', 'r5', 0.85],
['r3', 'r0', 0.84],
['r5', 'r0', 0.82],
['r4', 'r1', 0.86],
['r10', 'r8', 0.82],
['r3', 'r2', 0.81],
['r9', 'r8', 0.84],
['r12', 'r8', 0.81],
['r11', 'r8', 0.85],
['r2', 'r1', 0.85],
['r7', 'r6', 0.80],
['r6', 'r5', 0.83]
], columns=['id_1', 'id_2', 'score'])
print(scores)

The following figure illustrates the output as a graph, where nodes represent records and the presence/absence of edges represent match predictions.

Press + to interact
Nodes represent records and edges represent match predictions
Nodes represent records and edges represent match predictions

In real-world scenarios, we will face graphs consisting of thousands of connected components, where one might look like our toy example. The clustering algorithms below only alter the shape within a connected component but never across two.

Note: Two graph nodes are termed as connected if a path exists from one node to the other. A connected component is a maximal subgraph, where all nodes within the subgraph are connected.

Transitive clustering assigns the whole connected component to a single cluster. Let’s explore how MCC works.

Center clustering (CC)

MCC is an extension of center clustering (CC). The latter proceeds by the following steps:

  1. Sort the pairs by their similarity (score) in descending order.

  2. Every record will be a center or a no-center node. We initialize empty lists CC for center nodes, N ...