...

/

Oversampling with Text Augmentation

Oversampling with Text Augmentation

Learn how to improve the diversity of the training data by creating artificial matches using text augmentation.

We discuss the following two common issues with training data in this lesson:

  • Usually, our training datasets contain many examples of no-matches and only a few matches. In machine learning jargon, this is a severe class imbalance between the majority (no-matches) and minority classes (matches).

  • The few examples from the minority class do not cover all class-invariant transformations well, which we have seen in similar tasks (prior knowledge). Our model will not generalize well to unseen examples.

Let’s see how text augmentation can help reveal such problems using the following dataset of restaurant records:

Press + to interact
import pandas as pd
import recordlinkage.preprocessing as rlp
from itertools import combinations
from typing import Union
def cross_ref_to_index(df: pd.DataFrame, id_column: str, match_key_columns: Union[str, list[str]]) -> pd.MultiIndex:
match_lists = df.sort_values(id_column, ascending=False).groupby(match_key_columns)[id_column].apply(lambda s: list(s))
match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]
match_pairs = []
for match_list in match_lists:
match_pairs += list(combinations(match_list, 2))
return pd.MultiIndex.from_tuples(match_pairs)
# Load ground truth:
xref = pd.read_csv('solvers_kitchen/classes.csv')
true_matches = cross_ref_to_index(df=xref, id_column='customer_id', match_key_columns='class')
print('Number of actual matches in the data: ', true_matches.shape[0])
restaurants = pd.read_csv('solvers_kitchen/restaurants.csv').set_index('customer_id').drop(['city', 'restaurant_type'], axis=1)
for col in ['customer_name', 'street']:
restaurants[col] = rlp.clean(restaurants[col])
restaurants['phone'] = rlp.phonenumbers(restaurants['phone'])
print('Total number of pairs: ', restaurants.shape[0] * (restaurants.shape[0] - 1) / 2)
print(restaurants.head())

The first lines of the output show examples of matching pairs and how they vary in names and streets.

Testing performance on seen data

The restaurants dataset comes with the ground truth so that we can experiment and evaluate our work. It covers 112 actual matches among the 372816 pairs of records. Let’s assume we have already reviewed every pair, so the entire dataset is available for training. We aim to train a model that generalizes well to unseen records, in other words, to new records not covered by this data.

Below, we load precomputed similarity scores across the three dimensions and fit a binary classification ...