Embedding-Based Similarity
Learn how to measure the similarity of long, unstructured texts with embeddings and the cosine similarity.
Most entity resolution examples use edit-based similarity functions. They work well for short texts, such as names, addresses, and phone numbers. They usually are not a good fit for long, unstructured texts—for example, product descriptions in e-commerce.
Note: The Abt-Buy dataset we use in this lesson is open data. See the Glossary of this course for attribution and references.
import pandas as pdimport jellyfishimport textdistanceimport numpy as npdef damerau_levenshtein_similarity(x: str, y: str):"""Converts Damerau-Levensthein distance into similarity on [0, 1] scale."""return 1 - jellyfish.damerau_levenshtein_distance(x, y) / np.max([len(x), len(y)])abt = pd.read_csv('abt_buy/abt.csv', encoding='iso-8859-1')buy = pd.read_csv('abt_buy/buy.csv', encoding='iso-8859-1')xref = pd.read_csv('abt_buy/perfect_mapping.csv')pair = xref.sample(1, random_state=123)# Lower case and remove nonalphanumerics:abt_description = abt.loc[abt.id.isin(pair.idAbt), 'description'].str.lower().str.replace('[^a-z0-9 ]', '', regex=True).values[0]buy_description = buy.loc[buy.id.isin(pair.idBuy), 'description'].str.lower().str.replace('[^a-z0-9 ]', '', regex=True).values[0]print(abt_description)print('--- vs ---')print(buy_description)print('----------')print('Damerau-Levenshtein: ', damerau_levenshtein_similarity(abt_description.lower(), buy_description.lower()))print('Jaro-Winkler: ', jellyfish.jaro_winkler_similarity(abt_description.lower(), buy_description.lower()))print('Ratcliff-Obershelp: ', textdistance.ratcliff_obershelp.similarity(abt_description.lower(), buy_description.lower()))
As humans, we have little doubt that both descriptions refer to the same product. In both texts, we spot characteristic tokens like “nikon,” “d60,” “1855mm,” and “f3556g.” On the other hand, we understand that frequent words like “with,” “and,” and “product” are irrelevant. Damerau-Levenshtein, Jaro-Winkler, and Ratcliff-Obershelp do not distinguish between relevant and irrelevant tokens.
Text embeddings
Text embeddings, or vectorization, is a broad family of techniques for transforming any text into a vector of numeric values . Techniques range from simple word counts to highly sophisticated encodings of meaning from large language models (LLMs) trained on massive datasets.
Instead of comparing two texts and by their characters, we transform them first into numeric vectors and ...