...

/

Embedding-Based Similarity

Embedding-Based Similarity

Learn how to measure the similarity of long, unstructured texts with embeddings and the cosine similarity.

Most entity resolution examples use edit-based similarity functions. They work well for short texts, such as names, addresses, and phone numbers. They usually are not a good fit for long, unstructured texts—for example, product descriptions in e-commerce.

Note: The Abt-Buy dataset we use in this lesson is open data. See the Glossary of this course for attribution and references.

Press + to interact
import pandas as pd
import jellyfish
import textdistance
import numpy as np
def damerau_levenshtein_similarity(x: str, y: str):
"""Converts Damerau-Levensthein distance into similarity on [0, 1] scale."""
return 1 - jellyfish.damerau_levenshtein_distance(x, y) / np.max([len(x), len(y)])
abt = pd.read_csv('abt_buy/abt.csv', encoding='iso-8859-1')
buy = pd.read_csv('abt_buy/buy.csv', encoding='iso-8859-1')
xref = pd.read_csv('abt_buy/perfect_mapping.csv')
pair = xref.sample(1, random_state=123)
# Lower case and remove nonalphanumerics:
abt_description = abt.loc[abt.id.isin(pair.idAbt), 'description'].str.lower().str.replace('[^a-z0-9 ]', '', regex=True).values[0]
buy_description = buy.loc[buy.id.isin(pair.idBuy), 'description'].str.lower().str.replace('[^a-z0-9 ]', '', regex=True).values[0]
print(abt_description)
print('--- vs ---')
print(buy_description)
print('----------')
print('Damerau-Levenshtein: ', damerau_levenshtein_similarity(abt_description.lower(), buy_description.lower()))
print('Jaro-Winkler: ', jellyfish.jaro_winkler_similarity(abt_description.lower(), buy_description.lower()))
print('Ratcliff-Obershelp: ', textdistance.ratcliff_obershelp.similarity(abt_description.lower(), buy_description.lower()))

As humans, we have little doubt that both descriptions refer to the same product. In both texts, we spot characteristic tokens like “nikon,” “d60,” “1855mm,” and “f3556g.” On the other hand, we understand that frequent words like “with,” “and,” and “product” are irrelevant. Damerau-Levenshtein, Jaro-Winkler, and Ratcliff-Obershelp do not distinguish between relevant and irrelevant tokens.

Text embeddings

Text embeddings, or vectorization, is a broad family of techniques for transforming any text xx into a vector of numeric values T(x)=(y1,,yd)=yT(x)=(y_1,\ldots,y_d)=\textbf{y}. Techniques range from simple word counts yiy_i to highly sophisticated encodings of meaning from large language models (LLMs) trained on massive datasets.

Instead of comparing two texts x1x_1 and x2x_2 by their characters, we transform them first into numeric vectors y1\textbf{y}_1 and y2\textbf{y}_2 ...