K-Nearest Embeddings Blocking

Become familiar with vector databases and their use for efficient index blocking.

Indexing in entity resolution helps address the main computational challenge, which is the large number of potential candidate pairs. Traditional methods rely on the following:

  • Lexical matching (also known as exact matching), such as SB

  • Lexical sorting, like in SN

Both methods work within the original feature space. Let’s explore how to measure similarity in a vector space and how to complement lexical by semantic search.

Concept

Let r1,,rnr_1,\ldots,r_n​ denote our dataset of size nn we aim to deduplicate. Indexing is about selecting match candidates from all possible pairs pij=(ri,rj)p_{ij}=(r_i,r_j), preserving most matches while removing many no-matches. The indexing algorithm must be very efficient because the number of all pairs grows quadratically with nn.

In other words, an indexer is a cheap matcher optimized for high recall, while precision is irrelevant. The indexers we discuss here have costs that are linear in nn (cheap enough). The figure below illustrates the construction:

Get hands-on with 1400+ tech skills courses.