...

/

Understanding Tokenization and Its Importance

Understanding Tokenization and Its Importance

Learn how tokenization converts a document into individual units, like words.

Tokenization

Tokenization is the process of breaking down a document into smaller components, usually individual words, but sometimes sentences or phrases. A simple way to understand tokenization is through an example. Run the following code:

Press + to interact
library(tm, quietly=TRUE)
sampleText <- "The raising of ghosts or devils was a
promise liberally accorded by my favourite authors, the fulfilment of
which I most eagerly sought; and if my incantations were always
unsuccessful, I attributed the failure rather to my own inexperience and
mistake, than to a want of skill or fidelity in my instructors."
Boost_tokenizer(sampleText)

The outcome of Boost_tokenizer is evident: the text has been segmented into words. This concept is commonly known as the “bag of words,” facilitating analyses like word frequency assessment. However, this approach may compromise contextual information. For example, is the word “interest” a relational term or a financial term? In the above context, it’s obvious the author is talking about their involvement in their life on shore. If the context was a quarterly report from a public company, “interest” would have an entirely different meaning.

Advanced tokenization techniques, such as phrase or sentence recognition, maintain the distinction between useful and not useful units. This rationale underlines the existence of various tokenizers. The tm package provides a function to list tokenizers included with the package.

Press + to interact
library(tm, quietly=TRUE)
getTokenizers()

These three tokenizers are similar in behavior:

The Boost_tokenizer implements the Boost tokenizer from the boost library. It breaks strings up by spaces and punctuation.

The MC_tokenizer implements a tokenizer from the MC Toolkit. Unfortunately, documentation and source code have become elusive.

The scan_tokenizer is an alias ...