Search⌘ K
AI Features

Introduce Similarity

Explore how to engineer similarity features for entity resolution by comparing record pairs from multiple angles. Understand how to transform distance measures into similarity scores and apply efficient methods for indexing and scoring. This lesson helps you develop a strong foundation in similarity feature engineering to improve binary classification of record matches.

Entity resolution is about identifying records that belong to the same real-world entity. We compare candidate pairs of records and decide if it is a match or no-match for each. In other words, we have to solve a binary classification problem.

Features for binary classification

Let’s introduce feature engineering in the context of entity resolution. Let pij=(ri,rj)p_{ij}=(r_i,r_j) denote a candidate pair we want to classify into match vs. no-match. We base our classification decision not on the raw records themselves but on similarity features derived from each pair.

We feed the model with vectors of numeric values T(pij)=sij=(sij,1,,sij,d)T(p_{ij})=\mathbf s_{ij}=(s_{ij,1},\ldots,s_{ij,d}), where every component sij,ks_{ij,k} measures the similarity of the two records rir_i ...