Deduplication vs. Linkage
Understand the subtle differences between different perspectives on an entity resolution task.
There are many synonyms for entity resolution—record linkage, data matching, fuzzy matching, and deduplication to name a few. Isn’t deduplication different from the rest?
Research papers and many implementations distinguish between single dataset deduplication vs. linkage across two, or more, datasets. Practically, these are different perspectives on the same problem. We can reshape every multiset linkage task into single-set deduplication and vice versa.
Example: Reshaping one task into another
The typical enterprise has multiple sources recording customer data, such as ERP and CRM systems. Too often, no join key exists between the two customer tables.
import pandas as pderp_customers = pd.DataFrame([[0, 'ABC Corp.', '123 Oak Ave', 'Montreal QC', 'CA'],[1, 'Contoso', 'Rue de Bell.', '75374 Paris', 'FR'],[2, 'DB Schenker', 'Kruppstr. 4', '45128 Essen', 'DE'],[3, 'Big Kahuna', '99 Ham Drive', 'Miami 33185', 'US'],[4, 'Schenker AG', 'Krupp Straße 4', 'Essen-City', 'DE']], columns=['Id', 'CompanyName', 'AddressLine1', 'AddressLine2', 'Country']).set_index('Id')crm_accounts = pd.DataFrame([['a3b6l', 'Entity Hero', 'Bolzweg', '47839', 'Krefeld', 'DE'],['ak44j', 'Luha Libre Ole', 'C. de Benova', '66667', 'Palencia', 'ES'],['a89ci', 'Deutsche Bahn', 'Kruppstr. 4', '45128', 'Essen', 'DE'],['aa341', 'Contoso Inc.', 'P.O. Box 123', '75374', 'Paris', 'FR'],['a31bc', 'Bambini Pub', '66 Bobcat Lane', '09876', 'London', 'UK']], columns=['AccId', 'Name', 'Street', 'ZipCode', 'City', 'Country']).set_index('AccId')
Linking two DataFrames means we compare records from the first with records from the second dataset without duplicating each in isolation. Assume we can confidently match 2
with a89ci
and a89ci
with 4
. We can reason that 2
and 4
are duplicates within the ERP—more on this idea in the following sections.
In the alternative matching approach, we reshape the second table so that its format matches the first to join both vertically.
all_records = pd.concat([erp_customers.reset_index().assign(Source='erp'),(crm_accounts.reset_index().assign(Source='crm', AddressLine2=crm_accounts['ZipCode'] + ' ' + crm_accounts['City']).rename(columns={'AccId': 'Id', 'Name': 'CompanyName', 'Street': 'AddressLine1'}))], ignore_index=True).filter(items=['Source', 'Id', 'CompanyName', 'AddressLine1', 'AddressLine2', 'Country']).set_index('Id')print(all_records)
Now, deduplicating means we compare pairs within this single dataset, whether within or across original dataset comparisons—for example, we directly compare records 2
and 4
.
The record linkage package can execute both formulations of this task. At the very least, it is a two-step procedure. First, select candidate pairs of records, and, second, compute the similarity for each pair. Here’s an example for both formulations comparing the names only:
import recordlinkage as rl# Linkage formulation:indexer = rl.Index()indexer.full()candidate_pairs = indexer.index(erp_customers, crm_accounts)comparer = rl.Compare()comparer.string('CompanyName', 'Name', method='jarowinkler', label='name_similarity_score')# Add more similarity functions herescores = comparer.compute(candidate_pairs, erp_customers, crm_accounts)print('Linkage scores:')print(scores.head())# Deduplication formulation:indexer = rl.Index()indexer.full()candidate_pairs = indexer.index(all_records)comparer = rl.Compare()comparer.string('CompanyName', 'CompanyName', method='jarowinkler', label='name_similarity_score')# Add more similarity functions herescores = comparer.compute(candidate_pairs, all_records)print('Deduplication scores:')print(scores.head())
Altering the index
The original linkage task does not cover direct comparisons within each original dataset but the deduplication task does. We changed the set of pairs we plan to compare, short the index. That looks like an accidental side effect in our example. There is a lot more behind indexing in general—for example, does it make sense to compare customers from different countries? We can reduce wasted computing using indexing techniques discussed in a dedicated chapter.
Collective entity resolution techniques
Think of two crime scenes. The inspector finds footprints in the first and hair samples in the second. Every scene in isolation does not deliver enough evidence to resolve a case. However, both crime scenes follow very similar patterns. The inspector identifies a single suspect after combining evidence from both scenes.
Similar effects happen when we apply collective entity resolution techniques. Resolving as a whole can result in more detected duplicates than deduplicating in isolation first and linking second. This only works with techniques that go beyond pairwise comparisons—for example, let’s assume that 2
with a89ci
and 2
with 4
match with overwhelming confidence, but 4
with a89ci
does not. We can still indirectly derive that 4
and a89ci
match in the second formulation but not in the first. Remember, we have not compared 2
with 4
in the first formulation.
The technique we apply in the example is called transitive clustering. It is a naive approach, which can lead to many wrong conclusions in practice. There are many alternative collective resolution techniques, each with pros and cons deserving a thorough evaluation before deployment.
Key takeaway
Subtle differences exist in how we execute entity resolution steps. Still, linkage and deduplication can be considered synonyms when the prior knowledge is the same and incorporated into the index accordingly.