Evaluate the Match Quality
Review classification errors and learn how to improve a matching model by example.
We'll cover the following...
The restaurants
dataset below is open data. See the glossary in this course's appendix for attribution. The dataset's class
column resolves the data, telling us which records belong to the same entity and which do not.
import pandas as pdclasses = pd.read_csv('solvers_kitchen/classes.csv')print(classes.head())
We transform this classes
cross-reference table to a pandas MultiIndex
object of matches—the same format we use for the predicted_matches
object, which represents our predicted matches:
from itertools import combinationsfrom typing import Uniondef cross_ref_to_index(df: pd.DataFrame, id_column: str, match_key_columns: Union[str, list[str]]) -> pd.MultiIndex:match_lists = df.sort_values(id_column, ascending=False).groupby(match_key_columns)[id_column].apply(lambda s: list(s))match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]match_pairs = []for match_list in match_lists:match_pairs += list(combinations(match_list, 2))return pd.MultiIndex.from_tuples(match_pairs)true_matches = cross_ref_to_index(df=classes, id_column='customer_id', match_key_columns='class')print('First three examples:')print(true_matches[:3])
This way, we can easily compare true_matches
with predicted_matches
and evaluate the matching quality.
Evaluation metrics
Predicting match vs. no-match is a binary classification problem. Those familiar with classification know that a simple accuracy won’t work here due to the heavy imbalance—we have many more no-matches than matches in a typical scenario. The entity resolution literature prefers precision and recall.
...