Introduce Similarity

Understand the concept of similarity and its relation to distance which is fundamental for entity resolution.

Entity resolution is about identifying records that belong to the same real-world entity. We compare candidate pairs of records and decide if it is a match or no-match for each. In other words, we have to solve a binary classification problem.

Features for binary classification

Let’s introduce feature engineering in the context of entity resolution. Let pij=(ri,rj)p_{ij}=(r_i,r_j) denote a candidate pair we want to classify into match vs. no-match. We base our classification decision not on the raw records themselves but on similarity features derived from each pair.

We feed the model with vectors of numeric values T(pij)=sij=(sij,1,,sij,d)T(p_{ij})=\mathbf s_{ij}=(s_{ij,1},\ldots,s_{ij,d}), where every component sij,ks_{ij,k} measures the similarity of the two records rir_i and rjr_j from a different angle. It is up to us to decide how to engineer this vector of similarities. Some examples are given below:

Get hands-on with 1200+ tech skills courses.