What is lemmatization?

Natural language processing

Natural language processing is a specialization of machine learning techniques where software is trained via data-driven means. With the vast exposure to an already tagged or labeled dataset, the algorithm tunes its parameters to derive the best possible prediction mechanism for an unseen data point.

Under the hood, the day-to-day semantics of the language require words to be altered from their root form to preserve the contextual add-ons related to the word. For instance, "drink" is the root word that might take many forms, such as "drunk" or "drank." The variation of forms is indicative of the word-level context offered alongside a word.

%0 node_1 care node_2 cared node_1->node_2 node_3 cares node_1->node_3 node_4 caring node_1->node_4
Example of some forms that can be derived from the word "care"

Lemmatization

Prior to feeding the text or data to a predictive model for analysis purposes, the words within the sentences are reduced down to their core root word.

For instance, the following is a sentence before lemmatization:

"The students planned a dinner for their instructors."

Following is the same sentence after lemmatization:

"The student plan a dine for their instructor."

Why lemmatize?

We use lemmatization to convert the input data set into a form that chops off the grammatical syntax to simplify the data set. This implies that multiple different words in the raw data set could map onto the same word after lemmatization. Consequently, this reduces noise and simplifies the training process alongside the obvious increase in the speed of analysis.

This approach works aptly under settings where the task is more closely associated with a contextual analysis of raw language. Popular applications are sentiment analysis and search querying, where the goal is to derive essence and return mappings from the root word. In other words, a good search algorithm would not discriminate and push down a result that mentions "cared" as opposed to a search query containing "care."

Lemmatization using NLTK library in Python

Below is an implementation of lemmatization that uses the standard Natural Language Toolkit of Python:

import nltk
from nltk.stem import WordNetLemmatizer
lemmatizer_instance = WordNetLemmatizer()
input_sentence = "The students planned a dinner for their instructors"
tokenized_form = nltk.word_tokenize(input_sentence)
final_sentence = ""
for word in tokenized_form:
final_sentence = final_sentence + lemmatizer_instance.lemmatize(word) + " "
print("Sentence before lemmatization is:")
print(input_sentence)
print("\n")
print("Sentence after lemmatization is:")
print(final_sentence)

Explanation

  • Line 3: We create an object, WordNetLemmatizer.
  • Line 5: We clean and tokenize the input_sentence.
  • Lines 7–8: We lemmatize all the words in the tokenized_form array.

Stemming vs. lemmatization

Stemming and lemmatization are often confused with each other in the domain. However, it needs to be noted that while they boil down to the same objective of eradication of grammatical add-ons to core words, their results and techniques differ starkly.

Stemming works with rule-based extraction that bluntly reduces the longer derivatives to the longest common substring in a cluster regardless of whether or not the longest common substring is the actual root word. Another possible consequence is that the reduced form might not be an actual word from the dictionary.

For instance, the word "studies," when lemmatized would result in the word "study" as opposed to "studi," which will be produced via stemming.

Stemming

Lemmatization

It has lower accuracy.

It has higher accuracy.

It's faster since there is no need to check for any underlying word context.

It's slower since there is a need to check for underlying word context.

The resulting root word may or may not be an actual, meaningful word from the dictionary.

The root word is an actual, meaningful word from the dictionary.