What is stemming in NLP?

Overview

We reduce word inflection to its root forms with an approach known as stemming. It’s a natural language processing method that helps prepare text, words, and documents for text normalization.

Word inflection is a process through which we alter words to convey a variety of grammatical categories, including tense, case, voice, aspect, person, number, gender, and mood. Therefore, even if a word may have various inflected forms, the NLP process becomes more complicated when multiple inflected forms appear in the exact text.

Consider the words playing, played, and plays. The root word for all these words is play.

Hence, we’ll use stemming for information retrieval, text mining SEOs, Web search results, indexing, tagging systems, word analysis, and more.

Below are some of the different stemming algorithms in Python NLTK:

  • Porter stemmer
  • Snowball stemmer
  • Lancaster stemmer

Porter stemmer

The Porter stemmer is well known for its simplicity and speed. Often, the resulting stem is the shorter term with the same root meaning. It’s designed to remove and replace well-known suffixes of English words.

In NLTK, we use the PorterStemmer() class to implement the Porter stemmer algorithm.

We use the stem() method of the PorterStemmer() class to stem a given word.

Code

from nltk.stem import PorterStemmer
porter = PorterStemmer()
words = ['plays','playing','played','play']
for word in words:
print(word, "->", porter.stem(word))

Snowball stemmer

The Snowball stemmer is precise and faster. It’s also designed to remove and replace well-known suffixes of English words.

In NLTK, we use the SnowballStemmer() class to implement the Snowball stemmer algorithm. It supports 15 non-English languages.

We use the stem() method of the SnowballStemmer() class to stem a given word.

Code

from nltk.stem import SnowballStemmer
snowball = SnowballStemmer(language='english')
words = ['plays','playing','played','play']
for word in words:
print(word, "->", snowball.stem(word))

Lancaster stemmer

Although the Lancaster stemmer is simple, it frequently yields results with excessive stemming. Over-stemming makes stems unintelligible or non-linguistic.

In NLTK, we use the LancasterStemmer() class to implement the Lancaster stemmer algorithm.

We use the stem() method of the LancasterStemmer() class to stem a given word.

Code

from nltk.stem import LancasterStemmer
lancast = LancasterStemmer()
words = ['plays','playing','played','play']
for word in words:
print(word, "->", lancast.stem(word))

Free Resources