Calculate tf-idf with quanteda
Learn to calculate tf-idf with quanteda to identify important words in documents for improved text analysis.
We'll cover the following...
tf-idf
with quanteda
The quanteda
package calculates the tf-idf
of a document-feature matrix using the dfm_tfidf()
function. Term frequency-inverse document frequency is a ratio used to identify important words for a collection of documents. To calculate this ratio, quanteda provides dfm_tfidf()
that calculates the term frequency-inverse document frequency (tf-idf
Here’s code to demonstrate the creation of tf-idf
:
Press + to interact
# install.packages("quanteda")# install.packages("readtext")library(quanteda, quietly = TRUE)library(readtext)tf_idf <- readtext(file = "data/mws*txt", docvarsfrom = "filenames") |>corpus() |>tokens(remove_numbers = TRUE, remove_punct = TRUE) |>tokens_remove(pattern = stopwords("english")) |>tokens_tolower() |>dfm() |>dfm_tfidf()tf_idf[1,order(tf_idf[1,],decreasing = TRUE)]
Here’s an ...