An Introduction to Entity Resolution in Python/

...

Introduce Similarity

Understand the concept of similarity and its relation to distance which is fundamental for entity resolution.

We'll cover the following...

Features for binary classification
Different perspectives matter
Distance to similarity
- Computational cost
- Key takeaway

Entity resolution is about identifying records that belong to the same real-world entity. We compare candidate pairs of records and decide if it is a match or no-match for each. In other words, we have to solve a binary classification problem.

Features for binary classification

Let’s introduce feature engineering in the context of entity resolution. Let $p_{ij}=(r_i,r_j)$ denote a candidate pair we want to classify into match vs. no-match. We base our classification decision not on the raw records themselves but on similarity features derived from each pair.

We feed the model with vectors of numeric values $T(p_{ij})=\mathbf s_{ij}=(s_{ij,1},\ldots,s_{ij,d})$ , where every component $s_{ij,k}$ measures the similarity of the two records $r_i$ ...

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Introduce Similarity

Features for binary classification