Removing and Replacing Tokens

Learn to remove stopwords and normalize word variations, such as synonyms, to improve the matching quality.

A text consists of one or more words and other tokens. Some of those are more informative than others. Words can vary in spelling, grammar, language, and more. Let’s discuss which types of words should be removed and which should be replaced to improve the matching quality.

Remove tokens aka stopwords

Stopwords are text tokens that are not informative. They can do more harm than good in an entity resolution task. Let’s take the restaurants open dataset and three of its records as an example—see the Glossary for attribution and references for open data.

Get hands-on with 1400+ tech skills courses.