Understanding Tokenization and Its Importance
Learn how tokenization converts a document into individual units, like words.
Tokenization
Tokenization is the process of breaking down a document into smaller components, usually individual words, but sometimes sentences or phrases. A simple way to understand tokenization is through an example. Run the following code:
The outcome of Boost_tokenizer is evident: the text has been segmented into words. This concept is commonly known as the “bag of words,” facilitating analyses like word frequency assessment. However, this approach may compromise contextual information. For example, is the word “interest” a relational term or a financial term? In the above context, it’s obvious the author is talking about their involvement in their life on shore. If the context was a quarterly report from a public company, “interest” would have an entirely different meaning.
Advanced tokenization techniques, such as phrase or sentence recognition, maintain the distinction between useful and not useful units. This rationale underlines the existence of various tokenizers. The tm package provides a function to list tokenizers included with the package.
These three tokenizers are similar in behavior:
The Boost_tokenizer implements the Boost tokenizer from the boost library. It breaks strings up by spaces and punctuation.
The MC_tokenizer implements a tokenizer from the MC Toolkit. Unfortunately, documentation and source code have become elusive.
The scan_tokenizer is an alias ...