Removing Other Unnecessary Terms
Learn to use the data cleaning tools available with the tm package to remove unnecessary terms.
Handling punctuation and numbers
You may have noticed several instances of newlines ( \n
) in the text. In most cases, punctuation, numbers, and extra white space are unnecessary for NLP analysis. In fact, these elements inflate the word count but don’t add meaning. In this lesson, we’ll talk about removing them as well.
Overview of transformations in the tm
package
In NLP, stopwords are removed to provide better visibility to significant words. However, stopwords aren’t the only problem when cleaning text data. Text often includes numbers, punctuation, white space, and capitalized versions of words. Therefore, it’s ...