Introduce Similarity

Understand the concept of similarity and its relation to distance which is fundamental for entity resolution.

We'll cover the following...

Entity resolution is about identifying records that belong to the same real-world entity. We compare candidate pairs of records and decide if it is a match or no-match for each. In other words, we have to solve a binary classification problem.

Features for binary classification

Let’s introduce feature engineering in the context of entity resolution. Let pij=(ri,rj)p_{ij}=(r_i,r_j) denote a candidate pair we want to classify into match vs. no-match. We base our classification decision not on the raw records themselves but on similarity features derived from each pair.

We feed the model with vectors of numeric values T(pij)=sij=(sij,1,,sij,d)T(p_{ij})=\mathbf s_{ij}=(s_{ij,1},\ldots,s_{ij,d}), where every component sij,ks_{ij,k} measures the similarity of the two records rir_i ...