SQL-First Entity Resolution with Splink

Become familiar with the Splink entity resolution framework and understand its SQL-first approach.

Many companies are heavily invested in SQL-first analytics platforms. Common commercial examples are Snowflake, Databricks, Google BigQuery, and Amazon Athena. These engines are optimized for computationally expensive data transformation jobs authored in SQL. Wouldn’t it be great to utilize the same SQL engine for expensive entity resolution workloads?

Introducing Splink

Splink is another entity resolution framework. Learners following this course might ask how it differs from RecordLinkage. Two key things have been given below:

  • In Splink, we only author jobs in Python. The framework translates this into SQL and sends it to a warehouse for the heavy lifting.

  • Splink is limited to the Fellegi-Sunter model family, which does not require manual labels to train the model. This means we need to worry less about modeling, for example, labeling and choosing among classification algorithms. It also means less complexity in what can be learned. The Fellegi-Sunter model is similar to a logistic regression—no boosted trees, no deep learning.

Fortunately, Splink supports DuckDB as a backend for the SQL runtime. DuckDB is an embedded analytics-optimized database engine distributed as a Python package. We don’t need to buy into a commercial analytics service or set up complex warehouse infrastructure to see Splink in action. Let’s use the restaurants dataset to demonstrate the typical Splink workflow.

Get hands-on with 1400+ tech skills courses.