About the Course

Understand the purpose of entity resolution, target audience, and prerequisites.

Businesses use enterprise resource planning (ERP) systems for operations, a customer relationship management (CRM) system for marketing, files, external application programming interfaces (API), and other data sources. Too often, records of the same customer hide across those sources, not formally integrated by join keys, with variations in names and language and even duplicates within each source. It happens for customers, suppliers, people, locations, and transactional data.

This course teaches you how to resolve these issues for tabular data.

Cross-reference tables

Learners might ask what “resolving records” means in concrete terms. It is about building cross-reference tables. Here’s an example:

ERP Customers

ID

CompanyName

AddressLine1

AddressLine2

Country

0

ABC Corp.

123 Oak Ave

Montreal QC

CA

1

Contoso

Rue de Bell.

75374 Paris

FR

2

DB Schenker

Kruppstr. 4

45128 Essen

DE

3

Big Kahuna

99 Ham Drive

Miami 33185

US

4

Schenker AG

Krupp Straße 4

Essen-City

DE

...

...

...

...

...

CRM Accounts

AccID

Name

Street

ZipCode

City

Country

a3b6l

Entity Hero

Bolzweg

47839

Krefeld

DE

ak44j

Lucha Libre Ole

C. de Benova

66667

Palencia

ES

a89ci

Deutsche Bahn

Kruppstr. 4

45128

Essen

DE

aa341

Contonso Inc.

P.O. Box 123

75374

Paris

FR

a31bc

Bambini Pub

66 Bobcat Lane

09876

London

UK

...

...

...

...

...

...

We have customer records from two systems with duplicates within each and no matching keys across the tables. The first covers our ERP, which handles orders, invoices, and other financial transactions. The second is used for marketing campaigns managed in our CRM, where we capture calls, visits, and other customer interactions. The marketing team wants us to analyze customer visits’ effectiveness in booking new orders.

We can fill the integration gap between ERP and CRM with entity resolution. The outcome is the following cross-reference table:

Cross-References

source

original_id

resolved_id

erp

1

c-0

crm

aa341

c-0

erp

2

c-1

erp

4

c-1

crm

a89ci

c-1

...

...

...

This table tells us which customer records belong to which customer entity. It bridges the original keys, allowing us to join and aggregate data within and across sources—a first step toward analyzing the effectiveness of our marketing campaigns.

Target audience

Entity resolution belongs in every data quality toolbox. As individual contributors (data scientists, engineers, developers), we’ll learn to conduct entity resolution end-to-end in Python. It is an active research field in computer science and statistics, a challenging skill to master, and a sought-after proficiency in companies facing data quality issues, in other words, practically everywhere.

Press + to interact

We’ll become familiar with the technology stack around entity resolution. What if it makes more sense to buy and not build many, or all, components? This course prepares us before we talk to vendors. Don’t let them sell stuff we don’t need; we must challenge vendors to get the best value for our money.

Prerequisites

Entity resolution sits at the intersection of computer science and statistics. We’ll use text processing, machine learning, graphs, and more. Learners are in a good position if they have some experience in the following areas:

  • Coding in Python
  • Manipulating DataFrames with pandas
  • Building a binary classification model with scikit-learn
  • Manipulating graphs with NetworkX

Attribution and glossary

This course would be nothing without many open-source packages and datasets. Please check out Glossary in the “Appendix” section of this course for attribution and a collection of useful links.