Why Preprocessing Is Important

Preprocessing is the most impactful step in most entity resolution tasks. If we need to improve matching quality and have some spare time, most likely, it is spent best here.

We decide about match vs. no-match between two records by measuring their similarity across attributes using one or more similarity functions. There are two interrelated aspects we want to motivate by example here.

  • What is the quality of the attributes? Can we improve it to improve the matching quality significantly?

  • Do the similarity functions measure attributes the way we want? If not, what do we need to take care of?

Both points are subject to preprocessing.

Note: Below, we use open data, referred to as “Amazon-Google,” “Abt-Buy,” and “North-Carolina-Voters.” See the Glossary of this course for attribution and references.

Case, punctuation, and special characters

We use the jellyfish Python package to demonstrate the impact of preprocessing on string similarity functions. RecordLinkage also uses jellyfish at the backend for many of its similarity functions.

The three examples below are seemingly innocent variations:

  • Lower case vs. upper case

  • With and without punctuation

  • With and without accents, such as in the German Umlaute ä, ö, and ü

Get hands-on with 1400+ tech skills courses.