SQL-First Entity Resolution with Splink
Become familiar with the Splink entity resolution framework and understand its SQL-first approach.
We'll cover the following...
Many companies are heavily invested in SQL-first analytics platforms. Common commercial examples are Snowflake, Databricks, Google BigQuery, and Amazon Athena. These engines are optimized for computationally expensive data transformation jobs authored in SQL. Wouldn’t it be great to utilize the same SQL engine for expensive entity resolution workloads?
Introducing Splink
Splink is another entity resolution framework. Learners following this course might ask how it differs from RecordLinkage. Two key things have been given below:
In Splink, we only author jobs in Python. The framework translates this into SQL and sends it to a warehouse for the heavy lifting.
Splink is limited to the Fellegi-Sunter model family, which does not require manual labels to train the model. This means we need to worry less about modeling, for example, labeling and choosing among classification algorithms. It also means less complexity in what can be learned. The Fellegi-Sunter model is similar to a logistic regression—no boosted trees, no deep learning.
Fortunately, Splink supports DuckDB as a backend for the SQL runtime. DuckDB is an embedded analytics-optimized database engine distributed as a Python package. We don’t need to buy into a commercial analytics service or set up complex warehouse infrastructure to see Splink in action. Let’s use the restaurants
dataset to demonstrate the typical Splink workflow.
import pandas as pdfrom splink.duckdb.linker import DuckDBLinkerfrom splink.duckdb.blocking_rule_library import block_onimport splink.duckdb.comparison_library as clfrom splink.charts import save_offline_chart# 1. Read and parse the input data:restaurants = pd.read_csv('solvers_kitchen/restaurants.csv')restaurants['phone'] = restaurants.phone.str.replace('[^0-9]', '', regex=True)# 2. Author the entity resolution job using the splink API:settings = {'link_type': 'dedupe_only','unique_id_column_name': 'customer_id','comparisons': [cl.jaro_winkler_at_thresholds('customer_name', distance_threshold_or_thresholds=[0.9, 0.8]),cl.jaro_winkler_at_thresholds('street', distance_threshold_or_thresholds=[0.9, 0.8, 0.7]),cl.exact_match('phone')]}linker = DuckDBLinker(restaurants, settings)# 3. Execute heavy lifting on duckdb:# 3.1. Estimate the prior probability:deterministic_rules = ['l.phone = r.phone','l.customer_name = r.customer_name','l.street = r.street']linker.estimate_probability_two_random_records_match(deterministic_rules, recall=0.6)# 3.2. Estimate u parameters with a clever trick applicable to entity resolution:linker.estimate_u_using_random_sampling(max_pairs=3e5)# 3.3. Estimate m parameters with expectation maximization:linker.estimate_parameters_using_expectation_maximisation(blocking_rule='1 = 1')# 3.4. Create clusters using transitive closure for links above a probability threshold_df_predict = linker.predict()clusters = linker.cluster_pairwise_predictions_at_threshold(df_predict, threshold_match_probability=0.9)print(clusters.as_pandas_dataframe().head())# 4. Inspect fitted Fellegi-Sunter model parameters:chart = linker.match_weights_chart()chart.save('output/match_weights_chart.png', scale_factor=5)
Let’s review the steps individually:
We prepare the input data similarly to what we would for any other entity resolution framework.
We configure the job with the Splink Python API, using the Jaro-Winkler scores for names and streets and exact phone matching. The subtle difference from other frameworks is that we must set thresholds for Jaro-Winkler—more on this in the next section.
Splink translates our configuration into SQL and executes the job on DuckDB.
We conclude by inspecting the Fellegi-Sunter model fit visually.
The match weights in the visual give us an idea about the fitted model—for example, according to the model, a Jaro-Winkler score between streets above 0.7 and below 0.8 has a negative weight (red) and is therefore more likely a no-match. The estimated probability between any two records is a combination of its realized weights and the prior probability.
Let’s highlight how the Fellegi-Sunter model works in the next section.
Fellegi-Sunter intuition
This section explains why we had to choose thresholds for the Jaro-Winker scores and where those estimated weights from the visual come from. We don’t cover every mathematical detail but enough to grow our intuition.
Ultimately, the model predicts the probability
The odds of a match are defined as the ratio
between the probability for a match vs. a no-match. Given a set of weights
...