How to remove stop words using spaCy in Python

In natural language processing (NLP), stop words are commonly used words that are considered to have little or no semantic meaning and are often removed from the text during preprocessing. These words are typically function wordsFunction words are typically small, common words that serve grammatical or structural functions in a sentence. They often include articles, prepositions, conjunctions, and pronouns. or words that occur frequently in a language.

Stop words include words like is, the, a, and, in, to, and so on. They are often considered noise in text analysis tasks because they do not carry significant information about the content or the context of the text. By removing stop words, we can focus on the more meaningful words that convey the main ideas and concepts in the text. This helps us improve the accuracy of certain NLP tasks such as sentiment analysis, text classification, etc., by eliminating irrelevant or redundant words.

Using spaCy

spaCy is a popular open-source library for NLP in Python. It provides various functionalities for text processing, including stop word removal.

spaCy provides a default list of stop words for various languages, including English, French, German, Spanish, and more. These stop word lists are accessible through the nlp object once we load the corresponding language model.

Note: List of stop words can vary depending on the specific NLP library or application being used. Different languages and domains may have different sets of stop words.

Let's see spaCy in action for the English language.

Code explanation

In the code above:

Line 2: We import the spaCy library, which is a popular NLP library in Python.
Line 5: We load en_core_web_sm which is an English language model provided by spaCy. This model contains linguistic annotations and trained pipelines for English text processing.
Line 8: We define a sample sentence we want to process and remove stop words from.
Line 9: We use the loaded language model, nlp, to process the text and create a doc object. The doc object represents the processed text and contains various linguistic annotations.
Line 12: We iterate over each token in the doc and check if it is a stop word or not. If the token is not a stop word, we add its text attribute to the filtered_tokens list.
Line 15: We print the list of filtered_tokens, which contains the words from the original sentence, excluding the stop words.

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources