Deep Matching

Understand how deep learning models change entity resolution workflows.

Entity resolution researchers have experimented with deep learning models for over a decade, outperforming classical approaches in precision and recall. In the early years, we saw low adoption due to the high computational costs of training these models, which consisted of millions of parameters. The situation today has changed dramatically thanks to several trends.

  • Google’s BERT paper brought a paradigm shift toward transfer learning—start from an open-source pretrained LLM and only fine-tune it for a specific task at a fraction of the typical costs and superior performance.

  • PyTorch and the Hugging Face community share frameworks, datasets, pretrained LLMs, and tutorials, all of which are open-source and free under a generous license.

  • Modern hardware (GPUs, TPUs, Mx chips on MacBooks, etc.) is affordable for practically everybody through cloud services or even inside personal computers at much lower costs than it used to be.

Let’s explore what this all means for entity resolution.

Shallow vs. deep learning workflows

When discussing deep learning, we always mean multilayer neural net architectures. Everything else is called shallow, which does not always mean simple—for example, a decision tree ensemble is still called shallow, even if it can grow arbitrarily complex. In other words, shallow vs. deep is not just about architectural complexity (although it tends to be the case) but a shift in how we train models.

The image below illustrates a typical shallow learning workflow. We treat feature engineering and model training as two separate steps during experimentation. Usually, we reengineer features using our domain expertise and the feedback from the last iteration—a lot of manual work.

Get hands-on with 1200+ tech skills courses.