Merging and Splitting Tokens
Let’s see how we can tokenize multiword expressions and multiword entities.
We'll cover the following
Overview
We extracted the name entities in the previous section, but what if we want to unite or split multiword named entities? And what if the tokenizer performed this not so well on some exotic tokens, and we want to split them by hand? In this lesson, we'll cover a very practical remedy for our multiword expressions, multiword named entities, and typos.
doc.retokenize
is the correct tool for merging and splitting the spans. Let's see an example of retokenization by merging a multiword named entity, as follows:
Get hands-on with 1400+ tech skills courses.