Performing Natural Language Processing with R/

...

Understanding Tokenization and Its Importance

Learn how tokenization converts a document into individual units, like words.

We'll cover the following...

Tokenization
What values do tokens have?
Use regular expressions to tokenize
Custom tokenizers creation
Tokenizers from other packages
- The importance of tokenizer choice

Press + to interact

The outcome of Boost_tokenizer is evident: the text has been segmented into words. This concept is commonly known as the “bag of words,” facilitating analyses like word frequency assessment. However, this approach may compromise contextual information. For example, is the word “interest” a relational term or a financial term? In the above context, it’s obvious the author is talking about their involvement in their life on shore. If the context was a quarterly report from a public company, “interest” would have an entirely different meaning.

Advanced tokenization techniques, such as phrase or sentence recognition, maintain the distinction between useful and not useful units. This rationale underlines the existence of various tokenizers. The tm package provides a function to list tokenizers included with the package.

Press + to interact

Before We Begin

Important Concepts in Natural Language Processing

Text Mining Package

Understanding Corpora and Sources

Converting Text to Structured Data

Document Insights and Advanced Search Techniques

Working with Metadata in the tm Package

Implementing NLP with the quanteda Package

Implementing NLP with the tidytext Package

Assess What You Have Learned About NLP

Concluding Remarks

Appendix

Understanding Tokenization and Its Importance

Tokenization