Calculate tf-idf with tidytext
Learn how to calculate tf-idf using tidytext in R.
We'll cover the following...
Use tidytext to create tf-idf
tf-idf can easily be calculated with tidytext using bind_tf_idf(). 
Here’s a code sample illustrating the use of bind_tf_idf() with tidytext and the tidyverse:
library(tidyverse)library(tidytext)library(readtext)tfidf_mws <- readtext(file = "data/mws*txt") %>%unnest_tokens(word, text) %>%count(doc_id, word, sort = TRUE) %>%bind_tf_idf( term = word, document = doc_id, n = n ) %>%arrange(desc(tf_idf))tfidf_mws
- Lines 1–3: These lines load the - tidyverse,- tidytext, and- readtextpackages.
- Line 5: - tfidf_mws <- readtext(file = "data/mws*txt") %>%reads text files and sets up the- tfidf_mwsobject to receive the final result of the pipeline. This code will fail if only one document is read. The formula for idf is:
- If the total number of documents in the corpus is equal to one, then the number of documents containing the word is also one. The natural log of 1/1 is zero, which causes the formula to fail. 
- Line 6: - unnest_tokens(word, text)tokenizes the text from- line 5by splitting it into individual words. Each word is then stored in the column named- word.
- Line 7: - count(doc_id, word, sort = TRUE)counts the frequency of each word in each document. The- doc_idcolumn represents the document identifier, and the ...