Lemmatization with tidytext
Learn how to perform text lemmatization using the tidytext package in R for improved text analysis.
Lemmatization with tidytext
The tidytext
package relies on textstem::lemmatize_words
for lemmatization. Lemmatization is a text preprocessing technique that involves reducing words to their base or root form, known as the lemma. When combined with the tidytext package in R, lemmatization becomes a straightforward process.
is an R package designed to perform text mining and analysis using the principles of tidy data. It provides functions and tools for manipulating and tidying text data, making it easier to work with.
Here’s code to perform lemmatization with tidytext
library(tidyverse)library(tidytext)library(readtext)library(textstem)library(SnowballC)lemma_dictionary <- readtext(file = "data/mws*txt") %>%make_lemma_dictionary( engine = 'hunspell')lemmafied <- readtext("data/mws*txt") %>%unnest_tokens(word, text) %>%mutate(stem = wordStem(word)) %>%mutate(lemm = lemmatize_words(word , dictionary = lemma_dictionary)) %>%filter(stem != lemm ) %>%select(-doc_id)print(lemmafied[, c("word","stem","lemm")], n = 100)lemmafied[7,c("word","stem","lemm")] # united vs unit vs unitelemmafied[88,c("word","stem","lemm")] # disadvantages vs advantage
Explaining the lemmatization code
The code above demonstrates how to perform lemmatization with tidytext
Lines 1–5: The
function is used to load the required libraries (tidyverse
, ...