Removing Other Unnecessary Terms
Learn to use the data cleaning tools available with the tm package to remove unnecessary terms.
Handling punctuation and numbers
You may have noticed several instances of newlines ( \n
) in the text. In most cases, punctuation, numbers, and extra white space are unnecessary for NLP analysis. In fact, these elements inflate the word count but don’t add meaning. In this lesson, we’ll talk about removing them as well.
Overview of transformations in the tm
package
In NLP, stopwords are removed to provide better visibility to significant words. However, stopwords aren’t the only problem when cleaning text data. Text often includes numbers, punctuation, white space, and capitalized versions of words. Therefore, it’s crucial to remove these elements to ensure accurate and effective text processing.
In tm
vocabulary, unnecessary terms can be removed with transformations. Transformations are performed across all documents in a corpus and include operations such as removing nontext characters, citations, numbers, and punctuation. This can include converting all documents to plaintext or converting all text to lowercase.
Transformations included with tm
can be listed with getTransformations
:
Get hands-on with 1400+ tech skills courses.