Oversampling with Text Augmentation
Learn how to improve the diversity of the training data by creating artificial matches using text augmentation.
We discuss the following two common issues with training data in this lesson:
Usually, our training datasets contain many examples of no-matches and only a few matches. In machine learning jargon, this is a severe class imbalance between the majority (no-matches) and minority classes (matches).
The few examples from the minority class do not cover all class-invariant transformations well, which we have seen in similar tasks (prior knowledge). Our model will not generalize well to unseen examples.
Let’s see how text augmentation can help reveal such problems using the following dataset of restaurant records:
import pandas as pdimport recordlinkage.preprocessing as rlpfrom itertools import combinationsfrom typing import Uniondef cross_ref_to_index(df: pd.DataFrame, id_column: str, match_key_columns: Union[str, list[str]]) -> pd.MultiIndex:match_lists = df.sort_values(id_column, ascending=False).groupby(match_key_columns)[id_column].apply(lambda s: list(s))match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]match_pairs = []for match_list in match_lists:match_pairs += list(combinations(match_list, 2))return pd.MultiIndex.from_tuples(match_pairs)# Load ground truth:xref = pd.read_csv('solvers_kitchen/classes.csv')true_matches = cross_ref_to_index(df=xref, id_column='customer_id', match_key_columns='class')print('Number of actual matches in the data: ', true_matches.shape[0])restaurants = pd.read_csv('solvers_kitchen/restaurants.csv').set_index('customer_id').drop(['city', 'restaurant_type'], axis=1)for col in ['customer_name', 'street']:restaurants[col] = rlp.clean(restaurants[col])restaurants['phone'] = rlp.phonenumbers(restaurants['phone'])print('Total number of pairs: ', restaurants.shape[0] * (restaurants.shape[0] - 1) / 2)print(restaurants.head())
The first lines of the output show examples of matching pairs and how they vary in names and streets.
Testing performance on seen data
The restaurants
dataset comes with the ground truth so that we can experiment and evaluate our work. It covers 112 actual matches among the 372816 pairs of records. Let’s assume we have already reviewed every pair, so the entire dataset is available for training. We aim to train a model that generalizes well to unseen records, in other words, to new records not covered by this data.
Below, we load precomputed similarity scores across the three dimensions and fit a binary classification ...