Understanding Lemmatization
Let’s learn what lemmatization is and how it works in spaCy.
We'll cover the following...
What is lemmatization?
A lemma is the base form of a token. We can think of a lemma as the form in which the token appears in a dictionary. For instance, the lemma of eating is eat; the lemma of eats is eat; ate similarly maps to eat. Lemmatization is the process of reducing the word forms to their lemmas. The following code is a quick example of how to do lemmatization with spaCy:
import spacynlp = spacy.load("en_core_web_md")doc = nlp("I went for working and worked for 3 years.")for token in doc:print(token.text, token.lemma_)
By now, we should be familiar with what the first three lines of the code do. Recall that we import the spacy
library, load an English model using spacy.load
, create a pipeline, and apply the pipeline to the preceding sentence to get a Doc object. Here, we iterated over tokens to get their text and lemmas.
Don't be anxious if all of this sounds too abstract—let's see lemmatization in action with a real-world example.