Implement tf-idf with tm
Explore tf-idf in text analysis with R's tm package.
We'll cover the following...
We discussed tf-idf or term frequency-inverse document frequency. This is a tool that identifies the importance of a word (or token) in a document. With it, we can select a token, and then draw assumptions on which document it most likely came from.
Let’s learn more about creating tf-idf, and then how to use it.
Calculating tf-idf with tm
Different packages have implemented different methods for calculating tf-idf. In the case of the tm
package, it’s done when creating a DTM. Other packages perform this in different ways.
Here is code to illustrate the creation of tf-idf:
library(tm, quietly = TRUE)newCorpus <- VCorpus(DataframeSource(compareText))DTmatrix <- DocumentTermMatrix(newCorpus,control = list(tolower = TRUE,#stopwords = TRUE,stripWhiteSpace = TRUE,removePunctuation = TRUE,removeNumbers = TRUE,weighting = weightTfIdf,#dictionary = c("garden"),tokenize = "Boost"))inspect(DTmatrix)
When we run this code, we’ll see the resulting DTM for four documents. The first terms in the DTM are “adrian,” “elizabeth,” and “garden.” The matrix values are clearly not term frequency (which are integers) but instead show the tf-idf calculated for each term in ...