When working with text data in NLP, we usually have to preprocess our data before carrying out the main task.
One common preprocessing step we take is removing stop words.
Let’s get to it.
Stop words are words in any language or corpus that occur frequently. For some NLP tasks, they do not provide any additional or valuable information to the text containing them. Words like a, they, the, is, an, etc. are usually considered stop words.
Let’s take the title of this article as an example:
How to remove stop words with NLTK library in Python
Words like how, to, with, and in, do not clearly state the topic of the article. However, keywords like remove, stop words, NLTK, library, and Python, give a much clearer idea of what to expect from this article.
Interestingly, some of these keywords are part of the tags for this article :)
While there is no universal list of stop words in NLP, many NLP libraries in Python provide their list. We can also decide to create our own list of stop words.
Here we will be using the list of stop words provided by the NLTK library, so we don’t have to write our own.
However, before we can use these stopwords from the NLTK library, we need to download it first.
import nltknltk.download('stopwords')
You should have already downloaded the stop words before trying this. Otherwise, you might get a
Lookup Error
.
Next, we convert our text to lowercase and split it into a list of its words. Afterwards, we create a new list containing words that are not in the list of stop words.
from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize# Add texttext = "How to remove stop words with NLTK library in Python"print("Text:", text)# Convert text to lowercase and split to a list of wordstokens = word_tokenize(text.lower())print("Tokens:", tokens)# Remove stop wordsenglish_stopwords = stopwords.words('english')tokens_wo_stopwords = [word for word in tokens if word not in english_stopwords]print("Text without stop words:", " ".join(tokens_wo_stopwords))
The output will look like this:
Text: "How to remove stop words with NLTK library in Python"
Tokens: ['how', 'to', 'remove', 'stop', 'words', 'with', 'nltk', 'library', 'in', 'python']
Text without stop words: "remove stop words nltk library python"
Sometimes you may need to add or remove words from your list of stop words.
For example, imagine you’re trying to classify food magazines based on what kinds of foods are the focus. Now, you would expect that the word food (or similar words) would be mentioned a lot. These would not provide valuable information.
Hence, food is a stop word and you may consider adding it to your list of stop words.
Luckily, stopwords.words('english')
returns a regular Python list which we can easily modify. Keep in mind that this does not change the stop words you downloaded to your disk.
from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenize# It returns a regular Python listenglish_stopwords = stopwords.words('english')# Add a list of wordsenglish_stopwords.extend(['food', 'meal', 'eat'])# Add a single wordenglish_stopwords.append('plate')# Remove a single wordenglish_stopwords.remove('not')
One exciting thing about NLTK’s stop words corpus is that there are stop words in 16 different languages.
We can get the list of available languages and use them as shown below.
from nltk.corpus import stopwords# Print the list of available languagesprint(stopwords.fileids())# Use any of the available languagesfrench_stopwords = stopwords.words('french')spanish_stopwords = stopwords.words('spanish')italian_stopwords = stopwords.words('italian')
Thanks for reading!