...

/

Edit- and Substring-Based Similarity

Edit- and Substring-Based Similarity

Get an overview of edit- and substring-based similarity functions for texts.

As humans, we understand that “Robert Schwarz,” “Rob Shwarts,” “Bob Shvarts,” and “Schwaz, Robert” are suspiciously similar. Can we also compute scores programmatically that represent our human perception?

Let’s explore several similarity functions for texts based on edit distances or common substrings. A third class of text similarities based on vectorization is out of scope here. We use the following toy dataset here:

Press + to interact
import pandas as pd
df = pd.DataFrame({'name': ['Robert Schwarz', 'Rob Shvarts', 'Bob Shavrts', 'Schwarz, Bob'],
'postcode': ['12345', '21345', '123', '--345']})
print(df)

With this dataset, we illustrate the characteristics of the different similarity functions and how to use them with the recordlinkage API.

Overview of string similarity functions

The large variety of string similarity functions can be overwhelming. Below, we give a brief introduction of a shortlist:

  • Levenshtein: Levenshtein counts the number of insertions, deletions, and substitutions to convert one text into another.

  • Damerau-Levenshtein: Damerau-Levenshtein works like Levenshtein but also allows transpositions of adjacent characters—for example, a single transposition instead of two substitutions results in a higher similarity between “abcd” and “abdc.”

  • Jaro: Jaro starts by counting the number of ...