Fighting Label Errors

Learn how to detect and treat label errors with confident learning techniques.

The real world is full of imperfect data. If we ignore issues, we might draw wrong conclusions and make suboptimal decisions. We understand this because this course focuses on resolving duplicate records, one of several data quality issues. However, the resolution outcome itself depends on the data and its quality.

This lesson introduces learners to confident learning. Consider it a robust alternative to standard (or naive) machine learning. In confident learning, potential data errors are part of the modeling so that algorithms can automatically adapt to imperfect data—for example, can we trust that the example labels we use for the initial training of our machine learning model are 100% accurate?

Detect label errors

Machine learning algorithms require some labeled examples for initial training. In entity resolution, we select a subset of pairs and assign them to the match or no-match class. Large-scale applications, such as master data management in the enterprise, involve several users reviewing pairs of records. Every such manual intervention is a potential error source.

Below, we read the restaurants dataset, create all record pairs, compute four similarity features for each, and add the human-made class labels:

Get hands-on with 1400+ tech skills courses.