Significance of Indexing

Understand how indexing helps reduce costs at the expense of precision and recall.

The matching quality is not the only factor of practical relevance. Think of it this way—we could hire an army of educated people to manually review each pair of records. That would likely push precisionIt measures the accuracy among all match predictions., recallIt measures the percentage of actual matches predicted as actual matches., and our costs through the roof.

Conversely, we could group records by exactly matching all relevant attributes after simple preprocessing (lowercasing, special character removal)—something we can implement very cheaply with a few lines of pandas or SQL. In practice, we must make a trade-off between matching quality and costs.

Press + to interact

Let’s understand the trade-off between costs and recall. We reduce the number of pairs before we proceed with costly comparisons.

For example, we use a customer record’s country as the filter criterion and select only other records from the same country for the comparison. We can implement this type of search efficiently by precomputing a mapping function, which returns a list of record indexes for any existing country value from our database. This kind of function is called a reverted index, and creating one is called indexing.

Math behind the brute-force approach

Let r1,r2,...,rnr_1,r_2,...,r_n ...