Search⌘ K

Reducing and Aggregating Terms

Explore the concepts of stemming and lemmatization in this lesson to understand how words can be reduced to their root forms for text analysis. Learn when to apply each technique in R, using packages like textstem and udpipe, to improve your NLP tasks by balancing speed and semantic precision.

Stemming vs. lemmatization

Stemming and lemmatization are two techniques used in natural language processing (NLP) to reduce words to their base or root form. This is done to simplify text processing and analysis by grouping together different forms of the same word. While stemming and lemmatization serve a similar purpose, they differ in their approach.

Understanding stemming

Consider the following words:

  • Walked

  • Walking

  • Walker

  • Walk

These are derivatives of “walk.” When calculating word frequency on text containing these words, we may not want them to appear as four instances. Rather, it might make more sense for them to count as four instances of ”walk.” Stemming is the process of reducing words to the stem of the word.

Here’s an example of how stemming works. ...