...

Edit- and Substring-Based Similarity

Get an overview of edit- and substring-based similarity functions for texts.

We'll cover the following...

Press + to interact

With this dataset, we illustrate the characteristics of the different similarity functions and how to use them with the recordlinkage API.

Overview of string similarity functions

The large variety of string similarity functions can be overwhelming. Below, we give a brief introduction of a shortlist:

Levenshtein: Levenshtein counts the number of insertions, deletions, and substitutions to convert one text into another.
Damerau-Levenshtein: Damerau-Levenshtein works like Levenshtein but also allows transpositions of adjacent characters—for example, a single transposition instead of two substitutions results in a higher similarity between “abcd” and “abdc.”
Jaro: Jaro starts by counting the number of single-character matches $m$ . A character from one text matches that of another if they are equal and if their position number in the respective texts is close—roughly half the length of the longer text. Next, we count the number of transpositions $t$ required to convert the sequence of matching characters in the order of the first text to the order of the second. The final score is $s_{j a r o} = 1 / 3 \cdot (m / l e n (t e x t_{1}) + m / l e n (t e x t_{2}) + (m - t) / m)$ ...

Edit- and Substring-Based Similarity