Removing Unnecessary Terms

Learn to remove irrelevant elements (numbers, punctuation, and stopwords) to assist and improve the text analysis process.

Cleaning text

Removing numbers, punctuation, and stopwords is a common preprocessing step in natural language processing (NLP) and text analytics.

Note: In the context of text analysis, it’s important to consider that the generalizations mentioned in this lesson may not universally apply. While numbers, punctuation, and stopwords are often treated as less significant elements in some text analysis tasks, their importance can vary depending on the specific application and context. For instance, in the case of large language models like GPT, these elements can play a crucial role in shaping the overall meaning and context of the text. It’s essential to evaluate their significance based on the specific requirements of our analysis.

In this lesson, let’s look at the results of removing these extra words. First, here is a piece of code that will break collections of documents (corpus) into words (tokens) and then create a matrix showing the most common words in each document (document-term matrix or DTM).

Get hands-on with 1400+ tech skills courses.