About the Course
Understand the purpose of entity resolution, target audience, and prerequisites.
We'll cover the following
Businesses use enterprise resource planning (ERP) systems for operations, a customer relationship management (CRM) system for marketing, files, external application programming interfaces (API), and other data sources. Too often, records of the same customer hide across those sources, not formally integrated by join keys, with variations in names and language and even duplicates within each source. It happens for customers, suppliers, people, locations, and transactional data.
This course teaches you how to resolve these issues for tabular data.
Cross-reference tables
Learners might ask what “resolving records” means in concrete terms. It is about building cross-reference tables. Here’s an example:
ERP Customers
ID | CompanyName | AddressLine1 | AddressLine2 | Country |
0 | ABC Corp. | 123 Oak Ave | Montreal QC | CA |
1 | Contoso | Rue de Bell. | 75374 Paris | FR |
2 | DB Schenker | Kruppstr. 4 | 45128 Essen | DE |
3 | Big Kahuna | 99 Ham Drive | Miami 33185 | US |
4 | Schenker AG | Krupp Straße 4 | Essen-City | DE |
... | ... | ... | ... | ... |
CRM Accounts
AccID | Name | Street | ZipCode | City | Country |
a3b6l | Entity Hero | Bolzweg | 47839 | Krefeld | DE |
ak44j | Lucha Libre Ole | C. de Benova | 66667 | Palencia | ES |
a89ci | Deutsche Bahn | Kruppstr. 4 | 45128 | Essen | DE |
aa341 | Contonso Inc. | P.O. Box 123 | 75374 | Paris | FR |
a31bc | Bambini Pub | 66 Bobcat Lane | 09876 | London | UK |
... | ... | ... | ... | ... | ... |
We have customer records from two systems with duplicates within each and no matching keys across the tables. The first covers our ERP, which handles orders, invoices, and other financial transactions. The second is used for marketing campaigns managed in our CRM, where we capture calls, visits, and other customer interactions. The marketing team wants us to analyze customer visits’ effectiveness in booking new orders.
We can fill the integration gap between ERP and CRM with entity resolution. The outcome is the following cross-reference table:
Cross-References
source | original_id | resolved_id |
erp | 1 | c-0 |
crm | aa341 | c-0 |
erp | 2 | c-1 |
erp | 4 | c-1 |
crm | a89ci | c-1 |
... | ... | ... |
This table tells us which customer records belong to which customer entity. It bridges the original keys, allowing us to join and aggregate data within and across sources—a first step toward analyzing the effectiveness of our marketing campaigns.
Target audience
Entity resolution belongs in every data quality toolbox. As individual contributors (data scientists, engineers, developers), we’ll learn to conduct entity resolution end-to-end in Python. It is an active research field in computer science and statistics, a challenging skill to master, and a sought-after proficiency in companies facing data quality issues, in other words, practically everywhere.
We’ll become familiar with the technology stack around entity resolution. What if it makes more sense to buy and not build many, or all, components? This course prepares us before we talk to vendors. Don’t let them sell stuff we don’t need; we must challenge vendors to get the best value for our money.
Prerequisites
Entity resolution sits at the intersection of computer science and statistics. We’ll use text processing, machine learning, graphs, and more. Learners are in a good position if they have some experience in the following areas:
- Coding in Python
- Manipulating DataFrames with pandas
- Building a binary classification model with scikit-learn
- Manipulating graphs with NetworkX
Attribution and glossary
This course would be nothing without many open-source packages and datasets. Please check out Glossary in the “Appendix” section of this course for attribution and a collection of useful links.