Deduplication vs. Linkage

There are many synonyms for entity resolution—record linkage, data matching, fuzzy matching, and deduplication to name a few. Isn’t deduplication different from the rest?

Research papers and many implementations distinguish between single dataset deduplication vs. linkage across two, or more, datasets. Practically, these are different perspectives on the same problem. We can reshape every multiset linkage task into single-set deduplication and vice versa.

Example: Reshaping one task into another

The typical enterprise has multiple sources recording customer data, such as ERP and CRM systems. Too often, no join key exists between the two customer tables.

Press + to interact

import pandas as pd
erp_customers = pd.DataFrame([
    [0, 'ABC Corp.', '123 Oak Ave', 'Montreal QC', 'CA'],
    [1, 'Contoso', 'Rue de Bell.', '75374 Paris', 'FR'],
    [2, 'DB Schenker', 'Kruppstr. 4', '45128 Essen', 'DE'],
    [3, 'Big Kahuna', '99 Ham Drive', 'Miami 33185', 'US'],
    [4, 'Schenker AG', 'Krupp Straße 4', 'Essen-City', 'DE']
], columns=['Id', 'CompanyName', 'AddressLine1', 'AddressLine2', 'Country']).set_index('Id')
crm_accounts = pd.DataFrame([
    ['a3b6l', 'Entity Hero', 'Bolzweg', '47839', 'Krefeld', 'DE'],
    ['ak44j', 'Luha Libre Ole', 'C. de Benova', '66667', 'Palencia', 'ES'],
    ['a89ci', 'Deutsche Bahn', 'Kruppstr. 4', '45128', 'Essen', 'DE'],
    ['aa341', 'Contoso Inc.', 'P.O. Box 123', '75374', 'Paris', 'FR'],
    ['a31bc', 'Bambini Pub', '66 Bobcat Lane', '09876', 'London', 'UK']
], columns=['AccId', 'Name', 'Street', 'ZipCode', 'City', 'Country']).set_index('AccId')

Press + to interact

import recordlinkage as rl
# Linkage formulation:
indexer = rl.Index()
indexer.full()
candidate_pairs = indexer.index(erp_customers, crm_accounts)
comparer = rl.Compare()
comparer.string('CompanyName', 'Name', method='jarowinkler', label='name_similarity_score')
# Add more similarity functions here
scores = comparer.compute(candidate_pairs, erp_customers, crm_accounts)
print('Linkage scores:')
print(scores.head())
# Deduplication formulation:
indexer = rl.Index()
indexer.full()
candidate_pairs = indexer.index(all_records)
comparer = rl.Compare()
comparer.string('CompanyName', 'CompanyName', method='jarowinkler', label='name_similarity_score')
# Add more similarity functions here
scores = comparer.compute(candidate_pairs, all_records)
print('Deduplication scores:')
print(scores.head())

Collective entity resolution techniques

Think of two crime scenes. The inspector finds footprints in the first and hair samples in the second. Every scene in isolation does not deliver enough evidence to resolve a case. However, both crime scenes follow very similar patterns. The inspector identifies a single suspect after combining evidence from both scenes.

Similar effects happen when we apply collective entity resolution techniques. Resolving as a whole can result in more detected duplicates than deduplicating in isolation first and linking second. This only works with techniques that go beyond pairwise comparisons—for example, let’s assume that 2 with a89ci and 2 with 4 match with overwhelming confidence, but 4 with a89ci does not. We can still indirectly derive that 4 and a89ci match in the second formulation but not in the first. Remember, we have not compared 2 with 4 in the first formulation.

The technique we apply in the example is called transitive clustering. It is a naive approach, which can lead to many wrong conclusions in practice. There are many alternative collective resolution techniques, each with pros and cons deserving a thorough evaluation before deployment.

Key takeaway

Subtle differences exist in how we execute entity resolution steps. Still, linkage and deduplication can be considered synonyms when the prior knowledge is the same and incorporated into the index accordingly.

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Example: Reshaping one task into another

Altering the index

Collective entity resolution techniques

Key takeaway