Customizing the Tokenizer and Sentence Segmentation
Let's learn how we can add special case rules to the existing Tokenizer class.
We'll cover the following...
When we work with a specific domain, such as medicine, insurance, or finance, we often come across words, abbreviations, and entities that need special attention. Most domains we'll process have characteristic words and phrases that need custom tokenization rules. Here's how to add a special case rule to an existing Tokenizer
class instance:
Press + to interact
import spacyfrom spacy.symbols import ORTHnlp = spacy.load("en_core_web_md")doc = nlp("lemme that")print([w.text for w in doc])special_case = [{ORTH: "lem"}, {"ORTH": "me"}]nlp.tokenizer.add_special_case("lemme", special_case)print([w.text for w in nlp("lemme that")])
Here is what we did:
We again started by importing
spacy
.Then, we imported the
ORTH
symbol, which means orthography; that is, text.We continued ...