Similarity Features

Become familiar with the RecordLinkage API for engineering similarity features.

RecordLinkage follows the following two main steps:

  1. Indexing: Select which pairs of records are duplicate candidates and therefore should be compared.
  2. Scoring: Configure and compute a vector of similarity functions for every pair in the index.

All-in indexing

We keep it simple here and add every possible pair to the index—a “full” index in the RecordLinkage terminology.

Press + to interact
import pandas as pd
import recordlinkage as rl
restaurants = pd.read_csv('solvers_kitchen/restaurants.csv')
indexer = rl.Index()
indexer.full() # add all possible pairs to the index
candidate_links = indexer.index(restaurants.set_index('customer_id'))
print(candidate_links[:3])

Every element in the index is a pair of the customer_id values. The recordlinkage API warns us from using a full index, which can get very expensive computationally. That’s nothing we need to worry about now because of the small size of the data. The size of the full index is a simple function of the sample size.

Press + to interact
n = restaurants.shape[0]
print('Number of pairs in index: ', n * (n-1) / 2)

That’s roughly 373k pairs, which we will process in just a few seconds.

Measuring similarity

Our data below contains seven preprocessed attributes—clean and phonetic versions of the original data’s customer names, cities, and streets, and just clean phone numbers. We configure one similarity function per attribute.

Press + to interact
comparer = rl.Compare(n_jobs=-1)
print('Configuring one similarity function per attribute...')
for attribute in ['customer_name_c', 'customer_name_p', 'city_c', 'city_p']:
comparer.string(left_on=attribute, right_on=attribute, method='jarowinkler', label=attribute + '_score')
for attribute in ['street_c', 'street_p']:
comparer.string(left_on=attribute, right_on=attribute, method='damerau_levenshtein', label=attribute + '_score')
comparer.exact(left_on='phone_c', right_on='phone_c', label='phone_c_score')

The recordlinkage API has several built-in similarity functions. We have good reasons to choose different methods for different attributes.

The Jaro-WinklerIt is a similarity function counting nearby matching ...