Text preprocessing in NLP

Natural language processing (NLP) is a branch of artificial intelligence that empowers machines to comprehend and interpret human language. An essential phase in NLP is text preprocessing, where raw text data is purified and converted into a format optimized for subsequent analysis and modeling.

In this Answer, we will explore the essential text preprocessing techniques using Python and the popular NLP library, NLTK (Natural Language Toolkit).

Note: To learn about NLTK library in more detail, refer to this Answer.

Techniques

These are some common text processing techniques used in NLP:

Techniques of text preprocessing in NLP
Techniques of text preprocessing in NLP

1. Tokenization

Tokenization is the method through which text is segmented into smaller parts referred to as tokens, which can take the form of words or sentences. Tokenization makes it easier to work with text data and is often the first step in text preprocessing.

Let's demonstrate tokenization using NLTK:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Hello. Welcome to Educative."
# Word Tokenization
tokenized_words = word_tokenize(text)
print("Word Tokens:", tokenized_words)
# Sentence Tokenization
sent_tokens = sent_tokenize(text)
print("Sentence Tokens:", sent_tokens)

Note: To learn about tokenization using NLTK in more detail, refer to this Answer.

2. Lowercasing

Lowercasing involves converting all text to lowercase. This step helps to achieve text standardization, reducing the complexity and variations in the data.

text = "Hello. Welcome to Educative."
# Convert text to lowercase
lower_text = text.lower()
print("Lowercased Text:", lower_text)

3. Stopword removal

Stopwords are commonly used words (e.g., "a," "an," "the," "in") that add little meaning to the text and can be safely removed without sacrificing the overall context. NLTK provides a list of stopwords that we can utilize.

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
sample_text = "This is a sample text with some stopwords that need to be removed."
# Break the text into tokens
tokenized_words = word_tokenize(sample_text)
# Obtain the list of English stopwords
english_stopwords = set(stopwords.words("english"))
# Exclude stopwords from the text
processed_words = [word for word in tokenized_words if word.lower() not in english_stopwords]
print("Text after Removing Stopwords:", " ".join(processed_words))

4. Special character and numeric removal

In many NLP tasks, special characters and numbers may not be relevant. Removing them can help to simplify the text data.

import re
text = "This text contains special characters like @, #, $ and numbers like 123."
# Remove special characters and numbers using regex
cleaned_text = re.sub(r'[^a-zA-Z]', ' ', text)
print("Text after Special Character and Numeric Removal:", cleaned_text)

5. Lemmatization

Lemmatization is the process of reducing words to their base or root form (lemmas). It helps in normalizing words and reducing inflected forms to a common base.

from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize
sample_text = "Lemmatization aids in the stemming of words, making analysis easier."
# Split the text into tokens
tokenized_words = word_tokenize(sample_text)
# Apply lemmatization to the words
lemma_engine = WordNetLemmatizer()
words_after_lemmatization = [lemma_engine.lemmatize(word) for word in tokenized_words]
print("Text after Lemmatization:", " ".join(words_after_lemmatization))

6. Stemming

Stemming is a more aggressive word normalization technique that reduces words to their root form by removing suffixes or prefixes.

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
sample_text = "Stemming transforms words into their root form, which can occasionally be harsher than lemmatization."
# Break the text into tokens
tokenized_words = word_tokenize(sample_text)
# Apply stemming to the words
stemming_tool = PorterStemmer()
words_after_stemming = [stemming_tool.stem(word) for word in tokenized_words]
print("Text after Stemming:", " ".join(words_after_stemming))

Conclusion

These are some of the key text preprocessing techniques used in NLP. By following these steps, we can clean and transform text into a format more amenable to analysis and modeling.

Remember that the choice of preprocessing steps may vary depending on the specific NLP task and the characteristics of the text data. It's essential to experiment and fine-tune the preprocessing steps to achieve the best results for our particular application. Happy NLP-ing!

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved