Similarity Features
Become familiar with the RecordLinkage API for engineering similarity features.
We'll cover the following...
RecordLinkage follows the following two main steps:
- Indexing: Select which pairs of records are duplicate candidates and therefore should be compared.
- Scoring: Configure and compute a vector of similarity functions for every pair in the index.
All-in indexing
We keep it simple here and add every possible pair to the index—a “full” index in the RecordLinkage terminology.
import pandas as pdimport recordlinkage as rlrestaurants = pd.read_csv('solvers_kitchen/restaurants.csv')indexer = rl.Index()indexer.full() # add all possible pairs to the indexcandidate_links = indexer.index(restaurants.set_index('customer_id'))print(candidate_links[:3])
Every element in the index is a pair of the customer_id
values. The recordlinkage
API warns us from using a full index, which can get very expensive computationally. That’s nothing we need to worry about now because of the small size of the data. The size of the full index is a simple function of the sample size.
n = restaurants.shape[0]print('Number of pairs in index: ', n * (n-1) / 2)
That’s roughly 373k pairs, which we will process in just a few seconds.
Measuring similarity
Our data below contains seven preprocessed attributes—clean and phonetic versions of the original data’s customer names, cities, and streets, and just clean phone numbers. We configure one similarity function per attribute.
comparer = rl.Compare(n_jobs=-1)print('Configuring one similarity function per attribute...')for attribute in ['customer_name_c', 'customer_name_p', 'city_c', 'city_p']:comparer.string(left_on=attribute, right_on=attribute, method='jarowinkler', label=attribute + '_score')for attribute in ['street_c', 'street_p']:comparer.string(left_on=attribute, right_on=attribute, method='damerau_levenshtein', label=attribute + '_score')comparer.exact(left_on='phone_c', right_on='phone_c', label='phone_c_score')
The recordlinkage
API has several built-in similarity functions. We have good reasons to choose different methods for different attributes.
The