Effective Text Preprocessing
Learn how to preprocess string attributes with RecordLinkage in five lines of code.
Our goal is to resolve restaurants. Two records will be called duplicates if customer_name
, street
, city
, and phone
combined are similar enough. All these attributes are strings. We will apply a few cheap and effective preprocessing steps, increasing our matching quality by a large margin.
Semantic-preserving string manipulations
What do all transformations below have in common?
customer_name
:Hyde Street Bistro
>hyde street bistro
street
:70 w. 68th st.
>70 w 68th st
>70 west 68th street
city
:L.A.
>la
>los angeles
phone
:212/362-2200
>2123622200
They alter the text without altering the information content relevant to our matching task. In short, they preserve semantics in our context. Why manipulate at all if all versions are equivalent in meaning? The answer is that it might not matter for humans, but it does for algorithms we use for computing similarity.
For example, the
Get hands-on with 1400+ tech skills courses.