An Introduction to Entity Resolution in Python/

...

Oversampling with Text Augmentation

Learn how to improve the diversity of the training data by creating artificial matches using text augmentation.

We'll cover the following...

Testing performance on seen data
Encoding prior knowledge with text augmentation
Challenging our model with augmented data
Improving our model with augmented data
When augmentation is not enough
Key takeaway

We discuss the following two common issues with training data in this lesson:

Usually, our training datasets contain many examples of no-matches and only a few matches. In machine learning jargon, this is a severe class imbalance between the majority (no-matches) and minority classes (matches).
The few examples from the minority class do not cover all class-invariant transformations well, which we have seen in similar tasks (prior knowledge). Our model will not generalize well to unseen examples.

Let’s see how text augmentation can help reveal such problems using the following dataset of restaurant records:

Press + to interact

C++

import pandas as pd
import recordlinkage.preprocessing as rlp
from itertools import combinations
from typing import Union
def cross_ref_to_index(df: pd.DataFrame, id_column: str, match_key_columns: Union[str, list[str]]) -> pd.MultiIndex:
    match_lists = df.sort_values(id_column, ascending=False).groupby(match_key_columns)[id_column].apply(lambda s: list(s))
    match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]
    
    match_pairs = []
    for match_list in match_lists:
        match_pairs += list(combinations(match_list, 2))
    
    return pd.MultiIndex.from_tuples(match_pairs)
# Load ground truth:
xref = pd.read_csv('solvers_kitchen/classes.csv')
true_matches = cross_ref_to_index(df=xref, id_column='customer_id', match_key_columns='class')
print('Number of actual matches in the data: ', true_matches.shape[0])
restaurants = pd.read_csv('solvers_kitchen/restaurants.csv').set_index('customer_id').drop(['city', 'restaurant_type'], axis=1)
for col in ['customer_name', 'street']:
    restaurants[col] = rlp.clean(restaurants[col])
restaurants['phone'] = rlp.phonenumbers(restaurants['phone'])
print('Total number of pairs: ', restaurants.shape[0] * (restaurants.shape[0] - 1) / 2)
print(restaurants.head())

The first lines of the output show examples of matching pairs and how they vary in names and streets.

Testing performance on seen data

The restaurants dataset comes with the ground truth so that we can experiment and evaluate our work. It covers 112 actual matches among the 372816 pairs of records. Let’s assume we have already reviewed every pair, so the entire dataset is available for training. We aim to train a model that generalizes well to unseen records, in other words, to new records not covered by this data.

Below, we load precomputed similarity scores across the three dimensions and fit a binary classification model. We use the CatBoost ...

Introduction to Entity Resolution and Applications

A Quickstart Guide Using the RecordLinkage Package

Preprocessing

Indexing

Feature Engineering

Pairwise Matching

Clustering

Integration

Entity Resolution Fundamentals

Matching Products Across Two Online Shops

Conclusion

Appendix

Auto-Tagging System for Content Categorization

Oversampling with Text Augmentation

Testing performance on seen data