Introducing Tokenization
Let's learn about tokenization.
We'll cover the following...
Tokenization is the first step in a text processing pipeline. It is always the first operation because all the other operations require the tokens.
Tokenization means splitting the sentence into its tokens. A token is a unit of semantics. You can think of a token as the smallest meaningful part of a piece of text. Tokens can be words, numbers, punctuation, currency symbols, and any other meaningful symbols that are the building blocks of a sentence. The following are examples of tokens:
Example tokens
USA | NY |
city | 33 |
3rd | ! |
...? | 's |
Tokenization in spaCy
Input to the spaCy tokenizer is a Unicode text, and the result is a Doc
object. The following code shows the tokenization process:
Press + to interact
import spacynlp = spacy.load("en_core_web_md")doc = nlp("I own a ginger cat.")print ([token.text for token in doc])
The following is what we just did:
We start by importing
spaCy
. ...