

Introducing Tokenization

Introducing Tokenization

Let's learn about tokenization.

We'll cover the following...

Tokenization is the first step in a text processing pipeline. It is always the first operation because all the other operations require the tokens.

Tokenization means splitting the sentence into its tokens. A token is a unit of semantics. You can think of a token as the smallest meaningful part of a piece of text. Tokens can be words, numbers, punctuation, currency symbols, and any other meaningful symbols that are the building blocks of a sentence. The following are examples of tokens:

Example tokens









Tokenization in spaCy

Input to the spaCy tokenizer is a Unicode text, and the result is a Doc object. The following code shows the tokenization process:

Press + to interact
import spacy
nlp = spacy.load("en_core_web_md")
doc = nlp("I own a ginger cat.")
print ([token.text for token in doc])

The following is what we just did:

  • We start by importing spaCy. ...