Home/Blog/Data Science/How to solve cold start problems with synthetic data generation
Home/Blog/Data Science/How to solve cold start problems with synthetic data generation

How to solve cold start problems with synthetic data generation

Paul Kinsvater
7 min read

Training deduplication models on augmented data#

Unresolved data can be a severe problem for a business—think of duplicative customer or product records within a system and nonexistent join keys across systems. Unresolved data can affect practically every business function:

  • Purchase:There is a risk of duplicative spending on software licenses due to a need for more transparency regarding what has already been bought.

  • Manufacturing: Supply chains can ship the wrong products to the wrong markets because manufacturing needs to know what’s in the sales pipeline.

  • Sales: Salespeople can approach the wrong customers with the wrong products because they need to know what will be in stock.

  • Marketing: Campaigns can target the same customer multiple times using data from different channels.

  • Legal: GDPR and similar laws can require companies to respond promptly to a customer’s request to delete personal data.

Example: Deduplicating restaurant records#

The restaurants dataset has been provided by the DuDe team from the Hasso Plattner Institute, University of Potsdam“DuDe.” n.d. Hpi.de. Accessed May 3, 2024. https://hpi.de/naumann/projects/data-integration-data-quality-and-data-cleansing/dude.html. Many thanks for this great contribution.

Let’s take the following dataset of restaurant records as an example:

import pandas as pd
restaurants = pd.read_parquet('dude_restaurants.parquet')
num_records = restaurants.shape[0]
num_clusters = restaurants['cluster_id'].nunique()
num_redundant = num_records - num_clusters
print(f'Total number of records: {num_records}\nNumber of clusters: {num_clusters}')
print(f'\nTotal number of pairs: {num_records * (num_records-1)//2}\nNumber of matching pairs: {num_redundant}')
print('\nFirst five:')
restaurants.head()

Output:

Total number of records: 864
Number of clusters: 752
Total number of pairs: 372816
Number of matching pairs: 112
First five:
[4]:
"| | customer_id | cluster_id | name | addr | city | phone | type |\n|---:|--------------:|-------------:|:--------------------------|:------------------------|:------------|:-------------|:------------|\n| 0 | 1 | 0 | arnie morton's of chicago | 435 s. la cienega blv. | los angeles | 310/246-1501 | american |\n| 1 | 2 | 0 | arnie morton's of chicago | 435 s. la cienega blvd. | los angeles | 310-246-1501 | steakhouses |\n| 2 | 3 | 1 | art's delicatessen | 12224 ventura blvd. | studio city | 818/762-1221 | american |\n| 3 | 4 | 1 | art's deli | 12224 ventura blvd. | studio city | 818-762-1221 | delis |\n| 4 | 5 | 2 | hotel bel-air | 701 stone canyon rd. | bel air | 310/472-1211 | californian |"

The restaurants dataset contains the ground truth encoded in the cluster_id. All records within a cluster are pairwise duplicates. The data covers 112 matches among the 372816 possible pairs. Let’s assume we have already manually labeled a fraction to use it for training and treat the rest as a test set.

Below, we load pre-computed similarity scores and fit a binary classification model. We use the CatBoost family of binary classification algorithms, which allows us to add monotonicity constraints.

  • Line 6: We read the pre-computed similarity features. We’ve omitted the details here for brevity.
  • Line 10: We split the data into training and test sets to simulate a situation where we manually labeled half of the data.
  • Lines 14-16: We configure and fit a CatBoost classification algorithm to the training dataset.
  • Line 19: We print the estimated feature importance scores to get an idea of the relevance of each feature.
  • Last 3 lines: We evaluate the model’s performance on the test set using precision, recall, and F1F_1 as measures.
from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
# We omit the details of how to compute similarity features and just load pre-computed scores:
df = pd.read_parquet('similarity_features.parquet')
print(df.head())
# Stratified sampling on the class column gives us the same imbalances in both subsets:
df_train, df_test = train_test_split(df, test_size=0.5, stratify=df['class'], random_state=1)
# Evaluate performance:
print('\n--- Fit Catboost with monotonicity constraints ---')
model_features = ['name_score', 'street_score', 'phone_score']
model = CatBoostClassifier(random_state=1, monotone_constraints=[1, 1, 1])
model.fit(X=df_train[model_features], y=df_train['class'], verbose=False)
print('\nRelative feature importance:')
print(model.get_feature_importance(prettified=True))
y_test_pred = model.predict(df_test[model_features])
print('---\nPrecision on test set: ', precision_score(df_test['class'], y_test_pred))
print('Recall on test set: ', recall_score(df_test['class'], y_test_pred))
print('F1 on test set: ', f1_score(df_test['class'], y_test_pred))

Output:

| | name_score | street_score | phone_score | class |
|:-------|-------------:|---------------:|--------------:|--------:|
| (2, 1) | 1 | 0.990476 | 1 | 1 |
| (3, 1) | 0.568301 | 0.575926 | 0.3 | 0 |
| (3, 2) | 0.568301 | 0.610582 | 0.3 | 0 |
| (4, 1) | 0.594577 | 0.575926 | 0.3 | 0 |
| (4, 2) | 0.594577 | 0.610582 | 0.3 | 0 |
--- Fit Catboost with monotonicity constraints ---
Relative feature importance:
Feature Id Importances
0 name_score 36.647693
1 street_score 32.938509
2 phone_score 30.413798
---
Precision on test set: 0.9811320754716981
Recall on test set: 0.9285714285714286
F1 on test set: 0.9541284403669725

The test performance does make a strong impression. A precision of 98% means that 98 of every 100 predicted matches are also actual matches, and a recall of 93% means that we only missed 7 of every 100 actual matches. The F1F_1 score is the harmonic mean of both.

There are, however, two concerns with what we did:

  1. Assuming that half of all pairs are available for training is unrealistic. Most projects start without a single label, and labeling (or any other way of supervision) is a big part of the challenge.
  2. Even if we start with labeled data, how can we know that the available data represents unseen examples well? Our customer base might grow, and so might the number of dirty records.

We illustrate how to tackle both points with synthetic data.

Fighting cold starts with synthetic duplicates#

Let’s assume that the cluster_id column is left blank entirely. We call this a cold start. Let’s further assume that we ran out of budget. Proper manual labeling can be a costly exercise.

The trick is that manual labels are only one of our supervision sources. Call it domain knowledge, subject matter expertise, or common sense: a typo, word swap, or many other slight variations of a restaurant record still describe the same restaurant entity. In classification, we call this a class-invariant transformation.

In the following code playground, we use the nlpaug package to augment a restaurant name with a few class-invariant transformations. Where,

  • Line 15: We delete up to two random characters in a text.
  • Line 19: We swap adjacent characters in up to two words in a text.
  • Line 23: We swap, at most, one pair of adjacent words in the text.
  • Lines 27–31: We orchestrate several individual augmenters by randomly applying a subset of the steps.
import random
import numpy as np
import nlpaug.augmenter.char as nac # Character-level transformations
import nlpaug.augmenter.word as naw # Word-level transformations
import nlpaug.flow as naf # Orchestrate different transformations
random.seed(1)
np.random.seed(1)
customer_name = 'arnie mortons of chicago'
num_aug = 3 # Number of trials
print('Delete at most two random characters:')
augmenter_delete_char = nac.RandomCharAug(action='delete', aug_char_max=2)
print(augmenter_delete_char.augment(customer_name, n=num_aug))
print('\nSwap one pair of adjacent characters per word in max two words:')
augmenter_swap_char = nac.RandomCharAug(action='swap', aug_char_max=1, aug_word_max=2)
print(augmenter_swap_char.augment(customer_name, n=num_aug))
print('\nSwap one pair of adjacent words:')
augmenter_swap_word = naw.RandomWordAug(action='swap', aug_max=1)
print(augmenter_swap_word.augment(customer_name, n=num_aug))
print('\nOrchestrate two or more transformation in a "flow":')
name_augmenter = naf.Sometimes([
augmenter_delete_char,
augmenter_swap_char,
augmenter_swap_word
], aug_p=0.3) # The chance of execution for every step
print(name_augmenter.augment(customer_name, n=num_aug))

Output:

Delete at most two random characters:
['rie mortons of chiag', 'arnie mrton of hiago', 'rne moros of chicago']
Swap one pair of adjacent characters per word in max two words:
['arnei mrotons of chicago', 'arine mortons of chicgao', 'arine omrtons of chicago']
Swap one pair of adjacent words:
['arnie of mortons chicago', 'arnie mortons chicago of', 'mortons arnie of chicago']
Orchestrate two or more transformation in a "flow":
['arnie of mortons chicago', 'arnei morotns chicago of', 'arnie ortns of ciago']

The Sometimes augmenter orchestrates individual steps and executes a random subset. This way, we can cover diverse transformations with just a single configuration. For streets and phones, we apply different pipelines in the following code.

  • Line 19: We replace up to three tokens in a text using preconfigured aliases.
  • Line 34: We substitute up to ten characters by one of the provided candidates.
random.seed(1)
np.random.seed(1)
street = '435 s la cienega blv'
num_aug = 3 # Number of trials
# We have plenty of options, including hard-coded alias replacements:
reserved_tokens = [
['n', 'north'],
['e', 'east'],
['s', 'south'],
['w', 'west'],
['blv', 'blvd', 'boulevard'],
['st', 'street'],
['sts', 'streets'],
['ave', 'avenue'],
['rd', 'road']
]
reserved_aug = naw.ReservedAug(reserved_tokens=reserved_tokens, aug_max=3)
street_augmenter = naf.Sometimes([
reserved_aug,
nac.RandomCharAug(action='swap', aug_char_max=1, aug_word_max=2),
], aug_p=0.5)
print('Augment streets:')
print('Original: ', street)
print(street_augmenter.augment(street, n=num_aug))
phone = '3102461501'
phone_augmenter = naf.Sometimes([
nac.RandomCharAug(action='swap', aug_char_max=3, aug_word_max=1),
nac.RandomCharAug(action='substitute', aug_char_max=10, candidates=list('0123456789'))
], aug_p=0.5)
print('\nAugment phones:')
print('Original: ', phone)
print(phone_augmenter.augment(phone, n=num_aug))
Augment streets:
Original: 435 s la cienega blv
['435 south la cienega boulevard', '435 sotuh la cienega bouleavrd', '435 osuth la cienega lbvd']
Augment phones:
Original: 3102461501
['1320665801', '1320641501', '3102461150']

Let’s put those individual augmentation functions into serious work by orchestrating and systematically applying them to all our restaurant records.

  • The augment_record function produces duplicates for a single original record.
  • The create_synthetic_dataset function systematically applies augment_record on every original record.
  • In the last nine lines, we configure augmentation and apply it to all original records to produce synthetic clusters of size 5.
from functools import reduce
from nlpaug.base_augmenter import Augmenter
def augment_record(record: pd.Series, config: dict[str, Augmenter], n_per_attribute: int,
random_seed: int = None) -> pd.DataFrame:
"""Augment a single record across multiple attributes creating a table of several duplicates.
Starts from the original `record` and applies augmenters on every attribute configured in the `config`.
We return the cartesian product of the original and augmented attributes across all configured attributes.
We also extend the output by every attribute in `record` not in the `config` by exact copies of those attributes.
"""
if random_seed:
# nlpaug requires to use both to make it deterministic
random.seed(random_seed)
np.random.seed(random_seed)
augmented_attributes = dict()
for attribute, augmenter in config.items():
# Apply the configured augmenters:
augmented_attributes[attribute] = augmenter.augment(record[attribute], n=n_per_attribute)
# Create one single-column DataFrame per augmented attribute:
dfs = [pd.DataFrame({attribute: [record[attribute]] + records}) for attribute, records in
augmented_attributes.items()]
# Create cartesian product across all augmented attribute series:
res = reduce(lambda df1, df2: pd.merge(df1, df2, how='cross'), dfs)
# Add the same attribute values for any attributes not in the config:
for col in record.index:
if col not in res.columns:
res[col] = record[col]
# Return the DataFrame with columns in the same order as in the original record
return res[record.index]
def create_synthetic_dataset(df: pd.DataFrame, config: dict[str, Augmenter], cluster_size,
random_seed: int = None) -> pd.DataFrame:
"""Augment every record in `df` with the same augmentation `config`.
Returns a table of record clusters with approximately `cluster_size` records per cluster.
"""
augmented_dfs = []
# We get roughly n_per_attribute = round(cluster_size**(1/len(config))) + 1
n_per_attribute = round(cluster_size ** (1 / len(config))) + 10
for original_idx, row in df.iterrows():
try:
duplicates = augment_record(record=row, config=config, n_per_attribute=n_per_attribute,
random_seed=random_seed)
duplicates['cluster_id'] = original_idx
# Keep an original and random sample of rest to meet cluster size:
duplicates = pd.concat([duplicates.head(1), duplicates.tail(-1).drop_duplicates().sample(cluster_size - 1)])
except TypeError:
# nlpaug does not behave as expected in some cases, so not enough duplicates are created
# Let's just return the original record in those cases:
duplicates = pd.DataFrame(row).T
duplicates['cluster_id'] = original_idx
augmented_dfs.append(duplicates)
return pd.concat(augmented_dfs, ignore_index=True)
augment_config = {
'name': name_augmenter,
'addr': street_augmenter,
'phone': phone_augmenter
}
synthetic = create_synthetic_dataset(df=restaurants.drop(['cluster_id', 'customer_id'], axis=1), config=augment_config, cluster_size=5, random_seed=1)
print(f'Total number of records: {synthetic.shape[0]}\nAverage cluster size: {synthetic.cluster_id.value_counts().mean()}')
print('\nFirst cluster:')
synthetic.head()

Output:

Total number of records: 4320
Average cluster size: 5.0
First cluster:
| | name | addr | city | phone | type | cluster_id |
|---:|:----------------------------|:---------------------------------|:------------|:-----------------|:---------|-------------:|
| 0 | arnie morton's of chicago | 435 s. la cienega blv. | los angeles | 310/246-1501 | american | 0 |
| 1 | rni mortno ' s of chioa | 435 s. la icenega blv. | los angeles | 310 / 246 - 5401 | american | 0 |
| 2 | arn ortn ' s of chicago | 435 s. la icenega blv. | los angeles | 310 / 246 - 5113 | american | 0 |
| 3 | arnie morton ' s chicago of | 435 south. la icenega boulveard. | los angeles | 310 / 246 - 8015 | american | 0 |
| 4 | aie omrton ' s of cicoa | 435 sotuh. la ciengea boulevard. | los angeles | 310 / 246 - 1511 | american | 0 |

The synthetic examples of the first cluster look reasonably close, so we can consider them duplicates of the same restaurant.

Performance evaluation#

We pre-computed similarity features for all pairs in the synthetic data. Any two synthetic records with the same cluster_id are labeled with class=1. (match) and all others with class=0. (no match). Below, we train a model on just the synthetic examples and evaluate the performance on the original test set.

# Here, we also load pre-computed similarity features for our synthetic data:
df_synthetic = pd.read_parquet('synthetic_similarity_features.parquet')
# Evaluate performance:
print('\n--- Fit Catboost to synthetic data ---')
model_features = ['name_score', 'street_score', 'phone_score']
model_synthetic = CatBoostClassifier(random_state=1, monotone_constraints=[1, 1, 1])
model_synthetic.fit(X=df_synthetic[model_features], y=df_synthetic['class'], verbose=False)
print('\nRelative feature importance:')
print(model_synthetic.get_feature_importance(prettified=True))
y_test_pred = model_synthetic.predict(df_test[model_features])
print('---\nPrecision on test set: ', precision_score(df_test['class'], y_test_pred))
print('Recall on test set: ', recall_score(df_test['class'], y_test_pred))
print('F1 on test set: ', f1_score(df_test['class'], y_test_pred))

Output:

--- Fit Catboost to synthetic data ---
Relative feature importance:
Feature Id Importances
0 phone_score 51.308700
1 name_score 28.436685
2 street_score 20.254615
---
Precision on test set: 0.7903225806451613
Recall on test set: 0.875
F1 on test set: 0.8305084745762712

Our performance is worse, but overall, it’s not bad, considering that we did not use a single hand-labeled example. Note also that the model’s feature importance is drastically different from the one fitted to the original training data. Phones are a lot more important now.

Let’s also flip sides by using the model fitted to the original training data and evaluating its performance on synthetic data:

df_synthetic = pd.read_parquet('synthetic_similarity_features.parquet')
# Use model trained on original data and test on synthetic data:
y_synthetic_pred = model.predict(df_synthetic[model_features])
print('---\nPrecision on test set: ', precision_score(df_synthetic['class'], y_synthetic_pred))
print('Recall on test set: ', recall_score(df_synthetic['class'], y_synthetic_pred))
print('F1 on test set: ', f1_score(df_synthetic['class'], y_synthetic_pred))

Output:

---
Precision on test set: 0.7596377749029755
Recall on test set: 0.6796296296296296
F1 on test set: 0.7174098961514966

That’s a strong decrease compared to the performance on the original test data. We can explain this in two ways. First, note that our synthetic data generation has one significant flaw. If two original records are duplicates, every combination of their synthetic variations must match. 

We did not consider this since we did not know which original records were duplicates. As a consequence, some of our no-match labels are wrong. Let’s repeat the last evaluation, but this time on cleansed synthetic labels (something we cannot do easily in practice):

df_clean_synthetic = pd.read_parquet('clean_synthetic_similarity_features.parquet')
# Use model trained on original data and test on synthetic data:
y_clean_synthetic_pred = model.predict(df_clean_synthetic[model_features])
print('---\nPrecision on test set: ', precision_score(df_clean_synthetic['class'], y_clean_synthetic_pred))
print('Recall on test set: ', recall_score(df_clean_synthetic['class'], y_clean_synthetic_pred))
print('F1 on test set: ', f1_score(df_clean_synthetic['class'], y_clean_synthetic_pred))

Output:

---
Precision on test set: 0.9947255323305333
Recall on test set: 0.6771276595744681
F1 on test set: 0.8057599493630825

Our original model only suffers in recall but not precision when switching from the original test set to the clean synthetic data. In other words, it is highly accurate on the examples it predicts as a match but cannot catch a third of all actual matches in the synthetic data.

Some investigative work on the test set reveals that the original data contains very few example duplicates caused by word swaps or significant variations in phone numbers. Our synthetic data covers those well by design, which explains the drop in recall and why the synthetic model considers phones much more important.

When synthetic training data is not enough#

With data augmentation, we can use prior knowledge to fight bottlenecks in our training data. However, as we have seen in our experiments, more than synthetic training data is needed to translate into satisfactory performance. There are plenty of more opportunities:

Synthetic data generation is one of several exciting data-centric AI techniques. You can also look into programmatically detecting label errors or accelerating training with more weak supervision sources. Some model families are, by design, unable to score high on similarity after word swaps. More clever feature engineering or switching to a deep learning model can help. Pair-wise predictions of match vs. no-match are usually full of conflicts. We predict a match for record pairs (A, B) and (B, C) but a no-match for (A, C). Resolving conflicts makes results practicable and can improve the overall resolution quality.

Are you interested in learning more? Check out my Educative course below!

An Introduction to Entity Resolution in Python

Cover
An Introduction to Entity Resolution in Python

A typical business stores data across multiple systems, including ERPs for operations, a CRM for marketing, files, notebooks, and BI apps for other purposes. Records of the same customer (entity) exist in multiple places, likely not in sync across nor unique within sources. This inconsistent situation generates an opportunity for us to drive business value by cross-referencing and deduplicating records with entity resolution. This course covers business acumen and hands-on coding. It starts with several business cases and a quick introduction to entity resolution in Python. Then, it explores semantic-preserving preprocessing, similarity feature engineering, graph clustering, weak supervision, confident learning, and integration. As a developer, you’ll increase your company’s business value by developing and deploying entity resolution pipelines. As a decision-maker, you’ll know which solution best suits your business cases and how to negotiate the best value for your money.

8hrs
Advanced
192 Playgrounds
7 Quizzes

  

Free Resources