...

/

Evaluate the Match Quality

Evaluate the Match Quality

Review classification errors and learn how to improve a matching model by example.

The restaurants dataset below is open data. See the glossary in this course's appendix for attribution. The dataset's class column resolves the data, telling us which records belong to the same entity and which do not.

Press + to interact
import pandas as pd
classes = pd.read_csv('solvers_kitchen/classes.csv')
print(classes.head())

We transform this classes cross-reference table to a pandas MultiIndex object of matches—the same format we use for the predicted_matches object, which represents our predicted matches:

Press + to interact
from itertools import combinations
from typing import Union
def cross_ref_to_index(df: pd.DataFrame, id_column: str, match_key_columns: Union[str, list[str]]) -> pd.MultiIndex:
match_lists = df.sort_values(id_column, ascending=False).groupby(match_key_columns)[id_column].apply(lambda s: list(s))
match_lists = match_lists.loc[match_lists.apply(lambda s: len(s)) > 1]
match_pairs = []
for match_list in match_lists:
match_pairs += list(combinations(match_list, 2))
return pd.MultiIndex.from_tuples(match_pairs)
true_matches = cross_ref_to_index(df=classes, id_column='customer_id', match_key_columns='class')
print('First three examples:')
print(true_matches[:3])

This way, we can easily compare true_matches with predicted_matches and evaluate the matching quality.

Evaluation metrics

Predicting match vs. no-match is a binary classification problem. Those familiar with classification know that a simple accuracy won’t work here due to the heavy imbalance—we have many more no-matches than matches in a typical scenario. The entity resolution literature prefers precision and recall.

  • ...