Implement tf-idf with tm
Explore tf-idf in text analysis with R's tm package.
We'll cover the following...
We discussed tf-idf or term frequency-inverse document frequency. This is a tool that identifies the importance of a word (or token) in a document. With it, we can select a token, and then draw assumptions on which document it most likely came from.
Let’s learn more about creating tf-idf, and then how to use it.
Calculating tf-idf with tm
Different packages have implemented different methods for calculating tf-idf. In the case of the tm
package, it’s done when creating a DTM. Other packages perform this in different ways.
Here is code to illustrate the creation of tf-idf:
Press + to interact
library(tm, quietly = TRUE)newCorpus <- VCorpus(DataframeSource(compareText))DTmatrix <- DocumentTermMatrix(newCorpus,control = list(tolower = TRUE,#stopwords = TRUE,stripWhiteSpace = TRUE,removePunctuation = TRUE,removeNumbers = TRUE,weighting = weightTfIdf,#dictionary = c("garden"),tokenize = "Boost"))inspect(DTmatrix)
When we run this ...
Access this course and 1400+ top-rated courses and projects.