Natural language processing is a specialization of machine learning techniques where software is trained via data-driven means. With the vast exposure to an already tagged or labeled dataset, the algorithm tunes its parameters to derive the best possible prediction mechanism for an unseen data point.
Under the hood, the day-to-day semantics of the language require words to be altered from their root form to preserve the contextual add-ons related to the word. For instance, "drink" is the root word that might take many forms, such as "drunk" or "drank." The variation of forms is indicative of the word-level context offered alongside a word.
Prior to feeding the text or data to a predictive model for analysis purposes, the words within the sentences are reduced down to their core root word.
For instance, the following is a sentence before lemmatization:
"The students planned a dinner for their instructors."
Following is the same sentence after lemmatization:
"The student plan a dine for their instructor."
We use lemmatization to convert the input data set into a form that chops off the grammatical syntax to simplify the data set. This implies that multiple different words in the raw data set could map onto the same word after lemmatization. Consequently, this reduces noise and simplifies the training process alongside the obvious increase in the speed of analysis.
This approach works aptly under settings where the task is more closely associated with a contextual analysis of raw language. Popular applications are sentiment analysis and search querying, where the goal is to derive essence and return mappings from the root word. In other words, a good search algorithm would not discriminate and push down a result that mentions "cared" as opposed to a search query containing "care."
Below is an implementation of lemmatization that uses the standard Natural Language Toolkit of Python:
import nltkfrom nltk.stem import WordNetLemmatizerlemmatizer_instance = WordNetLemmatizer()input_sentence = "The students planned a dinner for their instructors"tokenized_form = nltk.word_tokenize(input_sentence)final_sentence = ""for word in tokenized_form:final_sentence = final_sentence + lemmatizer_instance.lemmatize(word) + " "print("Sentence before lemmatization is:")print(input_sentence)print("\n")print("Sentence after lemmatization is:")print(final_sentence)
WordNetLemmatizer
.input_sentence
.tokenized_form
array.Stemming and lemmatization are often confused with each other in the domain. However, it needs to be noted that while they boil down to the same objective of eradication of grammatical add-ons to core words, their results and techniques differ starkly.
Stemming works with rule-based extraction that bluntly reduces the longer derivatives to the longest common substring in a cluster regardless of whether or not the longest common substring is the actual root word. Another possible consequence is that the reduced form might not be an actual word from the dictionary.
For instance, the word "studies," when lemmatized would result in the word "study" as opposed to "studi," which will be produced via stemming.
Stemming | Lemmatization |
It has lower accuracy. | It has higher accuracy. |
It's faster since there is no need to check for any underlying word context. | It's slower since there is a need to check for underlying word context. |
The resulting root word may or may not be an actual, meaningful word from the dictionary. | The root word is an actual, meaningful word from the dictionary. |