An Introduction to Entity Resolution in Python/

...

Why Preprocessing Is Important

Understand the importance of attribute preprocessing for entity resolution tasks.

We'll cover the following...

Case, punctuation, and special characters
Generic attribute quality issues

Preprocessing is the most impactful step in most entity resolution tasks. If we need to improve matching quality and have some spare time, most likely, it is spent best here.

We decide about match vs. no-match between two records by measuring their similarity across attributes using one or more similarity functions. There are two interrelated aspects we want to motivate by example here.

What is the quality of the attributes? Can we improve it to improve the matching quality significantly?
Do the similarity functions measure attributes the way we want? If not, what do we need to take care of?

Both points are subject to preprocessing.

Note: Below, we use open data, referred to as “Amazon-Google,” “Abt-Buy,” and “North-Carolina-Voters.” See the Glossary of this course for attribution and references.

Case, punctuation, and special characters

We use the jellyfish Python package to demonstrate the impact of preprocessing on string similarity functions. RecordLinkage also uses jellyfish at the backend for many of its similarity functions.

The three examples below are seemingly innocent variations:

Lower case vs. upper case
With and without punctuation
With and without accents, such as in the German Umlaute ä, ö, and ü

Press + to interact

Python 3.8

import jellyfish
import numpy as np
def damerau_levenshtein_similarity(x: str, y: str):
    """Converts Damerau-Levensthein distance into similarity on [0, 1] scale."""
    return 1 - jellyfish.damerau_levenshtein_distance(x, y) / np.max([len(x), len(y)])
print('Case example: ', damerau_levenshtein_similarity('PAUL', 'paul'))
print('Punctuation example: ', damerau_levenshtein_similarity('p.k.', 'pk'))
print('Special character example: ', damerau_levenshtein_similarity('müller', 'mueller'))

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Why Preprocessing Is Important

Case, punctuation, and special characters