Performing Natural Language Processing with R/

...

Implement tf-idf with tm

Explore tf-idf in text analysis with R's tm package.

We'll cover the following...

Calculating tf-idf with tm
Other weighting methods
Summary

We discussed tf-idf or term frequency-inverse document frequency. This is a tool that identifies the importance of a word (or token) in a document. With it, we can select a token, and then draw assumptions on which document it most likely came from.

Let’s learn more about creating tf-idf, and then how to use it.

Calculating tf-idf with `tm`

Different packages have implemented different methods for calculating tf-idf. In the case of the tm package, it’s done when creating a DTM. Other packages perform this in different ways.

Here is code to illustrate the creation of tf-idf:

Press + to interact

library(tm, quietly = TRUE)
newCorpus <- VCorpus(DataframeSource(compareText))
DTmatrix <- DocumentTermMatrix(newCorpus, 
                     control = list(tolower = TRUE,
                                    #stopwords = TRUE,
                                    stripWhiteSpace = TRUE, 
                                    removePunctuation = TRUE,
                                    removeNumbers = TRUE,
                                    weighting = weightTfIdf,
                                    #dictionary = c("garden"),
                                    tokenize = "Boost"
                                    )
                               )
inspect(DTmatrix)

Before We Begin

Important Concepts in Natural Language Processing

Text Mining Package

Understanding Corpora and Sources

Converting Text to Structured Data

Document Insights and Advanced Search Techniques

Working with Metadata in the tm Package

Implementing NLP with the quanteda Package

Implementing NLP with the tidytext Package

Assess What You Have Learned About NLP

Concluding Remarks

Appendix

Implement tf-idf with tm

Calculating tf-idf with `tm`

Assess What You Have Learned About NLP

Implement tf-idf with tm

Calculating tf-idf with tm

Calculating tf-idf with `tm`