Natural language processing (NLP) is a branch of artificial intelligence that empowers machines to comprehend and interpret human language. An essential phase in NLP is text preprocessing, where raw text data is purified and converted into a format optimized for subsequent analysis and modeling.
In this Answer, we will explore the essential text preprocessing techniques using Python and the popular NLP library, NLTK (Natural Language Toolkit).
Note: To learn about NLTK library in more detail, refer to this Answer.
These are some common text processing techniques used in NLP:
Tokenization is the method through which text is segmented into smaller parts referred to as tokens, which can take the form of words or sentences. Tokenization makes it easier to work with text data and is often the first step in text preprocessing.
Let's demonstrate tokenization using NLTK:
import nltkfrom nltk.tokenize import word_tokenize, sent_tokenizetext = "Hello. Welcome to Educative."# Word Tokenizationtokenized_words = word_tokenize(text)print("Word Tokens:", tokenized_words)# Sentence Tokenizationsent_tokens = sent_tokenize(text)print("Sentence Tokens:", sent_tokens)
Note: To learn about tokenization using NLTK in more detail, refer to this Answer.
Lowercasing involves converting all text to lowercase. This step helps to achieve text standardization, reducing the complexity and variations in the data.
text = "Hello. Welcome to Educative."# Convert text to lowercaselower_text = text.lower()print("Lowercased Text:", lower_text)
Stopwords are commonly used words (e.g., "a," "an," "the," "in") that add little meaning to the text and can be safely removed without sacrificing the overall context. NLTK provides a list of stopwords that we can utilize.
from nltk.corpus import stopwordsfrom nltk.tokenize import word_tokenizesample_text = "This is a sample text with some stopwords that need to be removed."# Break the text into tokenstokenized_words = word_tokenize(sample_text)# Obtain the list of English stopwordsenglish_stopwords = set(stopwords.words("english"))# Exclude stopwords from the textprocessed_words = [word for word in tokenized_words if word.lower() not in english_stopwords]print("Text after Removing Stopwords:", " ".join(processed_words))
In many NLP tasks, special characters and numbers may not be relevant. Removing them can help to simplify the text data.
import retext = "This text contains special characters like @, #, $ and numbers like 123."# Remove special characters and numbers using regexcleaned_text = re.sub(r'[^a-zA-Z]', ' ', text)print("Text after Special Character and Numeric Removal:", cleaned_text)
Lemmatization is the process of reducing words to their base or root form (lemmas). It helps in normalizing words and reducing inflected forms to a common base.
from nltk.stem import WordNetLemmatizerfrom nltk.tokenize import word_tokenizesample_text = "Lemmatization aids in the stemming of words, making analysis easier."# Split the text into tokenstokenized_words = word_tokenize(sample_text)# Apply lemmatization to the wordslemma_engine = WordNetLemmatizer()words_after_lemmatization = [lemma_engine.lemmatize(word) for word in tokenized_words]print("Text after Lemmatization:", " ".join(words_after_lemmatization))
Stemming is a more aggressive word normalization technique that reduces words to their root form by removing suffixes or prefixes.
from nltk.stem import PorterStemmerfrom nltk.tokenize import word_tokenizesample_text = "Stemming transforms words into their root form, which can occasionally be harsher than lemmatization."# Break the text into tokenstokenized_words = word_tokenize(sample_text)# Apply stemming to the wordsstemming_tool = PorterStemmer()words_after_stemming = [stemming_tool.stem(word) for word in tokenized_words]print("Text after Stemming:", " ".join(words_after_stemming))
These are some of the key text preprocessing techniques used in NLP. By following these steps, we can clean and transform text into a format more amenable to analysis and modeling.
Remember that the choice of preprocessing steps may vary depending on the specific NLP task and the characteristics of the text data. It's essential to experiment and fine-tune the preprocessing steps to achieve the best results for our particular application. Happy NLP-ing!
Free Resources