...

/

Why Preprocessing Is Important

Why Preprocessing Is Important

Understand the importance of attribute preprocessing for entity resolution tasks.

Preprocessing is the most impactful step in most entity resolution tasks. If we need to improve matching quality and have some spare time, most likely, it is spent best here.

We decide about match vs. no-match between two records by measuring their similarity across attributes using one or more similarity functions. There are two interrelated aspects we want to motivate by example here.

  • What is the quality of the attributes? Can we improve it to improve the matching quality significantly?

  • Do the similarity functions measure attributes the way we want? If not, what do we need to take care of?

Both points are subject to preprocessing.

Note: Below, we use open data, referred to as “Amazon-Google,” “Abt-Buy,” and “North-Carolina-Voters.” See the Glossary of this course for attribution and references.

Case, punctuation, and special characters

We use the jellyfish Python package to demonstrate the impact of preprocessing on string similarity functions. RecordLinkage also uses jellyfish at the backend for many of its similarity functions.

The three examples below are seemingly innocent variations:

  • Lower case vs. upper case

  • With and without punctuation

  • With and without accents, such as in the German Umlaute ä, ö, and ü

Press + to interact
import jellyfish
import numpy as np
def damerau_levenshtein_similarity(x: str, y: str):
"""Converts Damerau-Levensthein distance into similarity on [0, 1] scale."""
return 1 - jellyfish.damerau_levenshtein_distance(x, y) / np.max([len(x), len(y)])
print('Case example: ', damerau_levenshtein_similarity('PAUL', 'paul'))
print('Punctuation example: ', damerau_levenshtein_similarity('p.k.', 'pk'))
print('Special character example: ', damerau_levenshtein_similarity('müller', 'mueller'))

Most humans would assign a perfect similarity of 1.0 to all three cases. That’s not how Damerau-Levenshtein and most other similarity functions work.

The cure here is simple—normalize the case, remove punctuation, and either remove accents, or, specifically for German Umlaute, replace them by “ae,” ...