An Introduction to Entity Resolution in Python/

...

Embedding-Based Similarity

Learn how to measure the similarity of long, unstructured texts with embeddings and the cosine similarity.

We'll cover the following...

Text embeddings
TF-IDF and the cosine similarity
TF-IDF for product descriptions
LLM-based text embeddings
Key takeaway

Press + to interact

import pandas as pd
import jellyfish
import textdistance
import numpy as np
def damerau_levenshtein_similarity(x: str, y: str):
    """Converts Damerau-Levensthein distance into similarity on [0, 1] scale."""
    return 1 - jellyfish.damerau_levenshtein_distance(x, y) / np.max([len(x), len(y)])
abt = pd.read_csv('abt_buy/abt.csv', encoding='iso-8859-1')
buy = pd.read_csv('abt_buy/buy.csv', encoding='iso-8859-1')
xref = pd.read_csv('abt_buy/perfect_mapping.csv')
pair = xref.sample(1, random_state=123)
# Lower case and remove nonalphanumerics:
abt_description = abt.loc[abt.id.isin(pair.idAbt), 'description'].str.lower().str.replace('[^a-z0-9 ]', '', regex=True).values[0]
buy_description = buy.loc[buy.id.isin(pair.idBuy), 'description'].str.lower().str.replace('[^a-z0-9 ]', '', regex=True).values[0]
print(abt_description)
print('--- vs ---')
print(buy_description)
print('----------')
print('Damerau-Levenshtein: ', damerau_levenshtein_similarity(abt_description.lower(), buy_description.lower()))
print('Jaro-Winkler: ', jellyfish.jaro_winkler_similarity(abt_description.lower(), buy_description.lower()))
print('Ratcliff-Obershelp: ', textdistance.ratcliff_obershelp.similarity(abt_description.lower(), buy_description.lower()))

Text embeddings

Text embeddings, or vectorization, is a broad family of techniques for transforming any text $x$ into a vector of numeric values $T(x)=(y_1,\ldots,y_d)=\textbf{y}$ . Techniques range from simple word counts $y_i$ to highly sophisticated encodings of meaning from large language models (LLMs) trained on massive datasets.

Instead of comparing two texts $x_1$ and $x_2$ by their characters, we transform them first into numeric vectors $\textbf{y}_1$ and $\textbf{y}_2$ ...

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Embedding-Based Similarity

Text embeddings