Home/Blog/Data Science/How to solve cold start problems with synthetic data generation

How to solve cold start problems with synthetic data generation

Paul Kinsvater

7 min read

content

Training deduplication models on augmented data Example: Deduplicating restaurant records Fighting cold starts with synthetic duplicates Performance evaluation When synthetic training data is not enough

Training deduplication models on augmented data#

Unresolved data can be a severe problem for a business—think of duplicative customer or product records within a system and nonexistent join keys across systems. Unresolved data can affect practically every business function:

Purchase:There is a risk of duplicative spending on software licenses due to a need for more transparency regarding what has already been bought.
Manufacturing: Supply chains can ship the wrong products to the wrong markets because manufacturing needs to know what’s in the sales pipeline.
Sales: Salespeople can approach the wrong customers with the wrong products because they need to know what will be in stock.
Marketing: Campaigns can target the same customer multiple times using data from different channels.
Legal: GDPR and similar laws can require companies to respond promptly to a customer’s request to delete personal data.

Example: Deduplicating restaurant records#

Total number of records: 864
Number of clusters: 752
Total number of pairs: 372816
Number of matching pairs: 112
First five:
[4]:
"|    |   customer_id |   cluster_id | name                      | addr                    | city        | phone        | type        |\n|---:|--------------:|-------------:|:--------------------------|:------------------------|:------------|:-------------|:------------|\n|  0 |             1 |            0 | arnie morton's of chicago | 435 s. la cienega blv.  | los angeles | 310/246-1501 | american    |\n|  1 |             2 |            0 | arnie morton's of chicago | 435 s. la cienega blvd. | los angeles | 310-246-1501 | steakhouses |\n|  2 |             3 |            1 | art's delicatessen        | 12224 ventura blvd.     | studio city | 818/762-1221 | american    |\n|  3 |             4 |            1 | art's deli                | 12224 ventura blvd.     | studio city | 818-762-1221 | delis       |\n|  4 |             5 |            2 | hotel bel-air             | 701 stone canyon rd.    | bel air     | 310/472-1211 | californian |"

The restaurants dataset contains the ground truth encoded in the cluster_id. All records within a cluster are pairwise duplicates. The data covers 112 matches among the 372816 possible pairs. Let’s assume we have already manually labeled a fraction to use it for training and treat the rest as a test set.

Below, we load pre-computed similarity scores and fit a binary classification model. We use the CatBoost family of binary classification algorithms, which allows us to add monotonicity constraints.

Line 6: We read the pre-computed similarity features. We’ve omitted the details here for brevity.
Line 10: We split the data into training and test sets to simulate a situation where we manually labeled half of the data.
Lines 14-16: We configure and fit a CatBoost classification algorithm to the training dataset.
Line 19: We print the estimated feature importance scores to get an idea of the relevance of each feature.
Last 3 lines: We evaluate the model’s performance on the test set using precision, recall, and $F_1$ as measures.

from catboost import CatBoostClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, f1_score
# We omit the details of how to compute similarity features and just load pre-computed scores:
df = pd.read_parquet('similarity_features.parquet')
print(df.head())
# Stratified sampling on the class column gives us the same imbalances in both subsets:
df_train, df_test = train_test_split(df, test_size=0.5, stratify=df['class'], random_state=1)
# Evaluate performance:
print('\n--- Fit Catboost with monotonicity constraints ---')
model_features = ['name_score', 'street_score', 'phone_score']
model = CatBoostClassifier(random_state=1, monotone_constraints=[1, 1, 1])
model.fit(X=df_train[model_features], y=df_train['class'], verbose=False)
print('\nRelative feature importance:')
print(model.get_feature_importance(prettified=True))
y_test_pred = model.predict(df_test[model_features])
print('---\nPrecision on test set: ', precision_score(df_test['class'], y_test_pred))
print('Recall on test set: ', recall_score(df_test['class'], y_test_pred))
print('F1 on test set: ', f1_score(df_test['class'], y_test_pred))

|        |   name_score |   street_score |   phone_score |   class |
|:-------|-------------:|---------------:|--------------:|--------:|
| (2, 1) |     1        |       0.990476 |           1   |       1 |
| (3, 1) |     0.568301 |       0.575926 |           0.3 |       0 |
| (3, 2) |     0.568301 |       0.610582 |           0.3 |       0 |
| (4, 1) |     0.594577 |       0.575926 |           0.3 |       0 |
| (4, 2) |     0.594577 |       0.610582 |           0.3 |       0 |
--- Fit Catboost with monotonicity constraints ---
Relative feature importance:
     Feature Id  Importances
0    name_score    36.647693
1  street_score    32.938509
2   phone_score    30.413798
---
Precision on test set:  0.9811320754716981
Recall on test set:  0.9285714285714286
F1 on test set:  0.9541284403669725

The test performance does make a strong impression. A precision of 98% means that 98 of every 100 predicted matches are also actual matches, and a recall of 93% means that we only missed 7 of every 100 actual matches. The $F_1$ score is the harmonic mean of both.

There are, however, two concerns with what we did:

Assuming that half of all pairs are available for training is unrealistic. Most projects start without a single label, and labeling (or any other way of supervision) is a big part of the challenge.
Even if we start with labeled data, how can we know that the available data represents unseen examples well? Our customer base might grow, and so might the number of dirty records.

We illustrate how to tackle both points with synthetic data.

Fighting cold starts with synthetic duplicates#

Let’s assume that the cluster_id column is left blank entirely. We call this a cold start. Let’s further assume that we ran out of budget. Proper manual labeling can be a costly exercise.

The trick is that manual labels are only one of our supervision sources. Call it domain knowledge, subject matter expertise, or common sense: a typo, word swap, or many other slight variations of a restaurant record still describe the same restaurant entity. In classification, we call this a class-invariant transformation.

In the following code playground, we use the nlpaug package to augment a restaurant name with a few class-invariant transformations. Where,

Line 15: We delete up to two random characters in a text.
Line 19: We swap adjacent characters in up to two words in a text.
Line 23: We swap, at most, one pair of adjacent words in the text.
Lines 27–31: We orchestrate several individual augmenters by randomly applying a subset of the steps.

import random
import numpy as np
import nlpaug.augmenter.char as nac  # Character-level transformations
import nlpaug.augmenter.word as naw  # Word-level transformations
import nlpaug.flow as naf  # Orchestrate different transformations
random.seed(1)
np.random.seed(1)
customer_name = 'arnie mortons of chicago'
num_aug = 3  # Number of trials
print('Delete at most two random characters:')
augmenter_delete_char = nac.RandomCharAug(action='delete', aug_char_max=2)
print(augmenter_delete_char.augment(customer_name, n=num_aug))
print('\nSwap one pair of adjacent characters per word in max two words:')
augmenter_swap_char = nac.RandomCharAug(action='swap', aug_char_max=1, aug_word_max=2)
print(augmenter_swap_char.augment(customer_name, n=num_aug))
print('\nSwap one pair of adjacent words:')
augmenter_swap_word = naw.RandomWordAug(action='swap', aug_max=1)
print(augmenter_swap_word.augment(customer_name, n=num_aug))
print('\nOrchestrate two or more transformation in a "flow":')
name_augmenter = naf.Sometimes([
    augmenter_delete_char,
    augmenter_swap_char,
    augmenter_swap_word
], aug_p=0.3)  # The chance of execution for every step
print(name_augmenter.augment(customer_name, n=num_aug))

Delete at most two random characters:
['rie mortons of chiag', 'arnie mrton of hiago', 'rne moros of chicago']
Swap one pair of adjacent characters per word in max two words:
['arnei mrotons of chicago', 'arine mortons of chicgao', 'arine omrtons of chicago']
Swap one pair of adjacent words:
['arnie of mortons chicago', 'arnie mortons chicago of', 'mortons arnie of chicago']
Orchestrate two or more transformation in a "flow":
['arnie of mortons chicago', 'arnei morotns chicago of', 'arnie ortns of ciago']

random.seed(1)
np.random.seed(1)
street = '435 s la cienega blv'
num_aug = 3  # Number of trials
# We have plenty of options, including hard-coded alias replacements:
reserved_tokens = [
    ['n', 'north'],
    ['e', 'east'],
    ['s', 'south'],
    ['w', 'west'],
    ['blv', 'blvd', 'boulevard'],
    ['st', 'street'],
    ['sts', 'streets'],
    ['ave', 'avenue'],
    ['rd', 'road']
]
reserved_aug = naw.ReservedAug(reserved_tokens=reserved_tokens, aug_max=3)
street_augmenter = naf.Sometimes([
    reserved_aug,
    nac.RandomCharAug(action='swap', aug_char_max=1, aug_word_max=2),
], aug_p=0.5)
print('Augment streets:')
print('Original: ', street)
print(street_augmenter.augment(street, n=num_aug))
phone = '3102461501'
phone_augmenter = naf.Sometimes([
    nac.RandomCharAug(action='swap', aug_char_max=3, aug_word_max=1),
    nac.RandomCharAug(action='substitute', aug_char_max=10, candidates=list('0123456789'))
], aug_p=0.5)
print('\nAugment phones:')
print('Original: ', phone)
print(phone_augmenter.augment(phone, n=num_aug))

from functools import reduce
from nlpaug.base_augmenter import Augmenter
def augment_record(record: pd.Series, config: dict[str, Augmenter], n_per_attribute: int,
                   random_seed: int = None) -> pd.DataFrame:
    """Augment a single record across multiple attributes creating a table of several duplicates.
    Starts from the original `record` and applies augmenters on every attribute configured in the `config`.
    We return the cartesian product of the original and augmented attributes across all configured attributes.
    We also extend the output by every attribute in `record` not in the `config` by exact copies of those attributes.
    """
    if random_seed:
        # nlpaug requires to use both to make it deterministic
        random.seed(random_seed)
        np.random.seed(random_seed)
    augmented_attributes = dict()
    for attribute, augmenter in config.items():
        # Apply the configured augmenters:
        augmented_attributes[attribute] = augmenter.augment(record[attribute], n=n_per_attribute)
    # Create one single-column DataFrame per augmented attribute:
    dfs = [pd.DataFrame({attribute: [record[attribute]] + records}) for attribute, records in
           augmented_attributes.items()]
    # Create cartesian product across all augmented attribute series:
    res = reduce(lambda df1, df2: pd.merge(df1, df2, how='cross'), dfs)
    # Add the same attribute values for any attributes not in the config:
    for col in record.index:
        if col not in res.columns:
            res[col] = record[col]
    # Return the DataFrame with columns in the same order as in the original record
    return res[record.index]
def create_synthetic_dataset(df: pd.DataFrame, config: dict[str, Augmenter], cluster_size,
                             random_seed: int = None) -> pd.DataFrame:
    """Augment every record in `df` with the same augmentation `config`.
    Returns a table of record clusters with approximately `cluster_size` records per cluster.
    """
    augmented_dfs = []
    # We get roughly n_per_attribute = round(cluster_size**(1/len(config))) + 1
    n_per_attribute = round(cluster_size ** (1 / len(config))) + 10
    for original_idx, row in df.iterrows():
        try:
            duplicates = augment_record(record=row, config=config, n_per_attribute=n_per_attribute,
                                        random_seed=random_seed)
            duplicates['cluster_id'] = original_idx
            # Keep an original and random sample of rest to meet cluster size:
            duplicates = pd.concat([duplicates.head(1), duplicates.tail(-1).drop_duplicates().sample(cluster_size - 1)])
        except TypeError:
            # nlpaug does not behave as expected in some cases, so not enough duplicates are created
            # Let's just return the original record in those cases:
            duplicates = pd.DataFrame(row).T
            duplicates['cluster_id'] = original_idx
        augmented_dfs.append(duplicates)
    return pd.concat(augmented_dfs, ignore_index=True)
augment_config = {
        'name': name_augmenter,
        'addr': street_augmenter,
        'phone': phone_augmenter
}
synthetic = create_synthetic_dataset(df=restaurants.drop(['cluster_id', 'customer_id'], axis=1), config=augment_config, cluster_size=5, random_seed=1)
print(f'Total number of records: {synthetic.shape[0]}\nAverage cluster size: {synthetic.cluster_id.value_counts().mean()}')
print('\nFirst cluster:')
synthetic.head()

Total number of records: 4320
Average cluster size: 5.0
First cluster:
|    | name                        | addr                             | city        | phone            | type     |   cluster_id |
|---:|:----------------------------|:---------------------------------|:------------|:-----------------|:---------|-------------:|
|  0 | arnie morton's of chicago   | 435 s. la cienega blv.           | los angeles | 310/246-1501     | american |            0 |
|  1 | rni mortno ' s of chioa     | 435 s. la icenega blv.           | los angeles | 310 / 246 - 5401 | american |            0 |
|  2 | arn ortn ' s of chicago     | 435 s. la icenega blv.           | los angeles | 310 / 246 - 5113 | american |            0 |
|  3 | arnie morton ' s chicago of | 435 south. la icenega boulveard. | los angeles | 310 / 246 - 8015 | american |            0 |
|  4 | aie omrton ' s of cicoa     | 435 sotuh. la ciengea boulevard. | los angeles | 310 / 246 - 1511 | american |            0 |

# Here, we also load pre-computed similarity features for our synthetic data:
df_synthetic = pd.read_parquet('synthetic_similarity_features.parquet')
# Evaluate performance:
print('\n--- Fit Catboost to synthetic data ---')
model_features = ['name_score', 'street_score', 'phone_score']
model_synthetic = CatBoostClassifier(random_state=1, monotone_constraints=[1, 1, 1])
model_synthetic.fit(X=df_synthetic[model_features], y=df_synthetic['class'], verbose=False)
print('\nRelative feature importance:')
print(model_synthetic.get_feature_importance(prettified=True))
y_test_pred = model_synthetic.predict(df_test[model_features])
print('---\nPrecision on test set: ', precision_score(df_test['class'], y_test_pred))
print('Recall on test set: ', recall_score(df_test['class'], y_test_pred))
print('F1 on test set: ', f1_score(df_test['class'], y_test_pred))

That’s a strong decrease compared to the performance on the original test data. We can explain this in two ways. First, note that our synthetic data generation has one significant flaw. If two original records are duplicates, every combination of their synthetic variations must match.

We did not consider this since we did not know which original records were duplicates. As a consequence, some of our no-match labels are wrong. Let’s repeat the last evaluation, but this time on cleansed synthetic labels (something we cannot do easily in practice):

df_clean_synthetic = pd.read_parquet('clean_synthetic_similarity_features.parquet')
# Use model trained on original data and test on synthetic data:
y_clean_synthetic_pred = model.predict(df_clean_synthetic[model_features])
print('---\nPrecision on test set: ', precision_score(df_clean_synthetic['class'], y_clean_synthetic_pred))
print('Recall on test set: ', recall_score(df_clean_synthetic['class'], y_clean_synthetic_pred))
print('F1 on test set: ', f1_score(df_clean_synthetic['class'], y_clean_synthetic_pred))

Our original model only suffers in recall but not precision when switching from the original test set to the clean synthetic data. In other words, it is highly accurate on the examples it predicts as a match but cannot catch a third of all actual matches in the synthetic data.

Some investigative work on the test set reveals that the original data contains very few example duplicates caused by word swaps or significant variations in phone numbers. Our synthetic data covers those well by design, which explains the drop in recall and why the synthetic model considers phones much more important.

When synthetic training data is not enough#

With data augmentation, we can use prior knowledge to fight bottlenecks in our training data. However, as we have seen in our experiments, more than synthetic training data is needed to translate into satisfactory performance. There are plenty of more opportunities:

Synthetic data generation is one of several exciting data-centric AI techniques. You can also look into programmatically detecting label errors or accelerating training with more weak supervision sources. Some model families are, by design, unable to score high on similarity after word swaps. More clever feature engineering or switching to a deep learning model can help. Pair-wise predictions of match vs. no-match are usually full of conflicts. We predict a match for record pairs (A, B) and (B, C) but a no-match for (A, C). Resolving conflicts makes results practicable and can improve the overall resolution quality.

An Introduction to Entity Resolution in Python

An Introduction to Entity Resolution in Python

A typical business stores data across multiple systems, including ERPs for operations, a CRM for marketing, files, notebooks, and BI apps for other purposes. Records of the same customer (entity) exist in multiple places, likely not in sync across nor unique within sources. This inconsistent situation generates an opportunity for us to drive business value by cross-referencing and deduplicating records with entity resolution. This course covers business acumen and hands-on coding. It starts with several business cases and a quick introduction to entity resolution in Python. Then, it explores semantic-preserving preprocessing, similarity feature engineering, graph clustering, weak supervision, confident learning, and integration. As a developer, you’ll increase your company’s business value by developing and deploying entity resolution pipelines. As a decision-maker, you’ll know which solution best suits your business cases and how to negotiate the best value for your money.

8hrs

Advanced

192 Playgrounds

7 Quizzes

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

How to solve cold start problems with synthetic data generation

Training deduplication models on augmented data#

Example: Deduplicating restaurant records#

Fighting cold starts with synthetic duplicates#

Performance evaluation#

When synthetic training data is not enough#