Programmatic Labeling

Understand how to complement hand labeling with any other supervision to build useful training datasets at low costs.

Entity resolution involves predicting whether relevant pairs of records match, essentially a binary classification problem. Two approaches for creating a classification model dominate the literature: rule-based and learning-based.

The bottleneck of prediction quality in the learning-based approach is the training data and no longer algorithms/modeling or computing power. The traditional approach of hand-labeling data does not scale well. Programmatic labeling is overcoming this bottleneck by complementing hand labeling with any source of supervision, including rule-based labeling.

Rule-based vs. learning-based

By rule-based, we mean heuristics formulated by subject matter experts—for example, “two customer records match if their names score above a threshold of 0.9 with Jaro-Winkler similarity and their postcode is the same or their billing address scores at least 0.8 by geodesic similarity.” Typically, we start with sound rules, test them on examples, and adapt by hand, such as playing around with the thresholds.

Learning-based means we feed a model family with match and no-match examples so that the algorithm can learn which model within this family best fits the provided data—for example, we choose a sample of 100 pairs of customer records, compute their name and address similarity scores, and hand label them into a match or a no-match using our human judgment. Next, we use the data to train a decision tree algorithm and apply it to all unlabeled examples.

This lesson is about combining both in a unified framework with the Snorkel package. We demonstrate the concepts using the restaurants dataset.

Get hands-on with 1400+ tech skills courses.