Understanding Tokenization and Its Importance
Learn how tokenization converts a document into individual units, like words.
Tokenization
Tokenization is the process of breaking down a document into smaller components, usually individual words, but sometimes sentences or phrases. A simple way to understand tokenization is through an example. Run the following code:
library(tm, quietly=TRUE)sampleText <- "The raising of ghosts or devils was apromise liberally accorded by my favourite authors, the fulfilment ofwhich I most eagerly sought; and if my incantations were alwaysunsuccessful, I attributed the failure rather to my own inexperience andmistake, than to a want of skill or fidelity in my instructors."Boost_tokenizer(sampleText)
The outcome of Boost_tokenizer
is evident: the text has been segmented into words. This concept is commonly known as the “bag of words,” facilitating analyses like word frequency assessment. However, this approach may compromise contextual information. For example, is the word “interest” a relational term or a financial term? In the above context, it’s obvious the author is talking about their involvement in their life on shore. If the context was a quarterly report from a public company, “interest” would have an entirely different meaning.
Advanced tokenization techniques, such as phrase or sentence recognition, maintain the distinction between useful and not useful units. This rationale underlines the existence of various tokenizers. The tm
package provides a function to list tokenizers included with the package.
library(tm, quietly=TRUE)getTokenizers()
These three tokenizers are similar in behavior:
The Boost_tokenizer
implements the Boost tokenizer from the boost library. It breaks strings up by spaces and punctuation.
The MC_tokenizer
implements a tokenizer from the MC Toolkit. Unfortunately, documentation and source code have become elusive.
The scan_tokenizer
is an alias ...