...

/

Calculate tf-idf with tidytext

Calculate tf-idf with tidytext

Learn how to calculate tf-idf using tidytext in R.

Use tidytext to create tf-idf

tf-idf can easily be calculated with tidytext using bind_tf_idf().

Here’s a code sample illustrating the use of bind_tf_idf() with tidytext and the tidyverse:

Press + to interact
library(tidyverse)
library(tidytext)
library(readtext)
tfidf_mws <- readtext(file = "data/mws*txt") %>%
unnest_tokens(word, text) %>%
count(doc_id, word, sort = TRUE) %>%
bind_tf_idf( term = word, document = doc_id, n = n ) %>%
arrange(desc(tf_idf))
tfidf_mws
  • Lines 1–3: These lines load the tidyverse, tidytext, and readtext packages.

  • Line 5: tfidf_mws <- readtext(file = "data/mws*txt") %>% reads text files and sets up the tfidf_mws object to receive the final result of the pipeline. This code will fail if only one document is read. The formula for idf is:

  • If the total number of documents in the corpus is equal to one, then the number of documents containing the word is also one. The natural log of 1/1 is zero, which causes the formula to fail.

  • Line 6: unnest_tokens(word, text) tokenizes the text from ...

Access this course and 1400+ top-rated courses and projects.