...

Text Preprocessing Essentials

Learn how preprocessing transforms raw, messy text into clean, normalized data that fuels advanced NLP models.

We'll cover the following...

The messy world of text
Why does text preprocessing matter?
What is tokenization?
What is stemming?
What is lemmatization?
Which is better: stemming or lemmatization?
Combining stemming and lemmatization
What are some other preprocessing techniques?
Tackling embedded biases in real-world data

Imagine you’re building an email spam filter, and a user sends a message with ALL CAPS, random punctuation, and maybe even an emoji or two. You can quickly spot the yelling as a human and decide whether to shrug it off or block it. But to a computer, that message is just a blob of characters—completely chaotic without further processing. It struggles to understand the true intention unless you clean things up first. That’s exactly what text preprocessing does: it tames the messy realm of human language, giving your models the clarity they need to do their job effectively.

Press + to interact

Now, apply that same principle to today’s awe-inspiring generative AI models—the ones writing poetry, generating code, and holding fluid conversations. It’s easy to forget that behind every clever output is an enormous amount of work spent massaging raw text into a usable form. Whether you’re building a chatbot or crafting an AI that suggests new lines of code, these systems can only be as smart as the data they’re trained on. That’s why text preprocessing is crucial—no matter how advanced the model, success still begins with clean, consistent input so you can reap reliable, high-quality results on the other side.

The messy world of text

Suppose you’re handed a giant folder of user-generated content—product reviews, live stream chat logs, or even casual Slack conversations within a company. You open one document and see something as follows:

As a human reader, you can probably get the gist of each statement. But imagine for a moment you’re a computer program trying to read this data: without any cleanup, all those extra symbols and stylistic flourishes quickly become noise, confusing any machine learning or AI algorithm. So, how do we tame messy text in a way machines can understand?

We’ll explore three essential techniques that transform raw, chaotic text into digestible input for NLP: tokenization, stemming, and lemmatization. We’ll also see how these steps emerged from real-world needs—like improving search engines—and evolved into indispensable tools for today’s cutting-edge foundation models. By understanding their roots and rationale, you’ll discover why these seemingly simple preprocessing tasks power some of the most advanced AI applications in use today.

Why does text preprocessing matter?

Back in the 1960s and 1970s, early search systems faced a huge challenge: all their text data was riddled with inconsistent spacing, random punctuation, and a seemingly endless array of word variations. Researchers quickly realized that raw, messy text didn’t lend itself to straightforward keyword matching. Splitting text into smaller segments, stripping away noise, and normalizing word forms became must-have techniques to make these early systems usable.

Fast-forward to today, those same foundational ideas have evolved through information retrieval and NLP to enable everything from text classification to sentiment analysis, eventually paving the way for modern generative AI (GenAI). Even advanced language models, like ChatGPT, rely on these basic preprocessing steps: tokenize and normalize your text before generating a coherent reply. Think of it like cleaning a camera lens: no matter how sophisticated the camera (or the AI model), you won’t capture good results if the lens is cluttered. Preprocessing ensures our lens on language is clear, keeping simple search queries and next-generation AI conversations running smoothly.

Educative byte: While early systems required extensive preprocessing, many modern transformer-based models are designed to handle relatively raw text. However, effective tokenization remains a critical first step—even for these models—to ensure consistency and to manage vocabulary efficiently.

What is tokenization?

Tokenization is splitting raw text into smaller units called tokens—these might be words, subwords, or individual characters. Why bother? Imagine you’re handed a dense, unformatted sentence without spaces or punctuation: "GenerativeAIisfascinatingandisthefuture". Without tokenization, deciphering meaningful segments becomes nearly impossible. Humans can intuitively parse this into generative AI is fascinating and is the future, but machines require explicit instructions to recognize word boundaries. Tokenization bridges this gap, enabling machines to identify and separate individual words or meaningful subunits within the text.

Press + to interact

In reality, tokenization results may vary depending on the tokenizer used. The above image is for demonstration purposes only. Moreover, languages vary widely in their structure and word boundaries. For instance:

English: Spaces separate words, making basic tokenization relatively straightforward.
Chinese/Japanese: Words often aren’t separated by spaces, requiring statistical models or dictionary-based approaches to segment text.
Social media and code: Hashtags (#MachineLearning), contractions (can't → can + not), and code snippets (int myVar = 5;) all require specialized tokenization strategies.

Effective tokenization must account for these linguistic nuances to accurately parse and process text across different languages, ensuring that NLP models remain versatile and applicable in diverse linguistic contexts.

Tokenization can be approached in various ways, each suited to different applications and linguistic complexities:

Word tokenization splits text into individual words. For example, the sentence "Generative AI is fascinating." becomes["Generative", "AI", "is", "fascinating", "."]. This method is straightforward but may struggle with compound words, contractions (e.g., "don't" might be split into ["don", "'t"] or handled specially), or hyphenated words. Advanced tokenizers use regular expressions or statistical models to manage these cases.
Subword tokenization breaks words into smaller units, which is particularly useful for handling unknown or rare words. For instance, "unhappiness" might be tokenized into ["un", "happiness"]. Take the word “tokenization.” A subword tokenizer might split it into ["token", "ization"], letting the model reuse “token” in “tokenizer” or “tokenized.” This is how models like GPT-4.5, Claude 3.7, and Grok 3 handle obscure terms like “supercalifragilisticexpialidocious” without breaking a sweat.

Note: Modern large language models often employ subword tokenization methods like byte pair encoding (BPE) or SentencePiece to handle massive vocabularies efficiently. Tools like “TikToken” are specifically designed for GPT-based models to keep track of token counts and ensure prompts fit within token limits. Context window sizes are evolving rapidly! We will take a closer look at these advanced tokenizers later in the course.

Character tokenization splits text into individual characters, such as
["G", "e", "n", "e", "r", "a", "t", "i", "v", "e", ...]. While this method captures every detail, it often results in longer sequences that can be computationally intensive for models to process.

By breaking down text into tokens, these models can better understand and manipulate language, enabling them to produce human-like responses. Effective tokenization ensures that generative AI systems can handle a vast array of linguistic inputs—from simple sentences to complex technical jargon—maintaining accuracy and fluency in their outputs.

Let’s implement a basic word tokenizer using Python. This example will split a sentence into words and punctuation marks, demonstrating how tokenization structures raw text.

Press + to interact

Python 3.10.4

def simple_tokenize(text):
    tokens = []
    current_word = ""
    for char in text:
        if char.isalnum():
            current_word += char
        else:
            if current_word != "":
                tokens.append(current_word)  # Append the accumulated word.
                current_word = ""
            if char.strip() != "":  # Ignore whitespace.
                tokens.append(char)  # Append punctuation or other non-alphanumeric characters.
    if current_word != "":
        tokens.append(current_word)  # Append any remaining word.
    return tokens
# Example usage
sentence = "Generative AI is fascinating!"
tokens = simple_tokenize(sentence)
print(tokens)

This simple function iterates through each character in the input text, building words by collecting alphanumeric characters and separating out punctuation as individual tokens. While rudimentary, this approach highlights the fundamental process of tokenization, providing a clear starting point for more advanced techniques.

While the provided examples illustrate basic tokenization, real-world applications often utilize advanced libraries like NLTK, spaCy, or Hugging Face’s Tokenizers for more efficient and sophisticated tokenization processes. These libraries handle a variety of languages and complex tokenization rules, making them indispensable for large-scale NLP projects.

Educative byte: GPT‑4.5 can handle around 128,000 input tokens at a time. Ever wonder how it measures input size? You guessed it—by counting tokens! Rather than tracking characters or raw words, the model breaks your text into these smaller blocks it can process, ensuring it stays within that massive but finite 128k‑token envelope.

Claude 3.7, on the other hand, can maintain up to 200,000 tokens, while Grok 3 pushes the limit even further with around 1 million tokens, give or take! But remember—bigger context windows don’t always mean better models. While large token limits help with long documents and maintaining memory over extended conversations, different models excel in different areas.

While tokenization effectively breaks down text into manageable units, it does not address the variability and complexity inherent in human language. Words frequently take on different forms—pluralization, verb tenses, comparative, and superlative degrees—all of which can dilute the effectiveness of NLP models if treated as unrelated entities. For example, “run,” “running,” and “ran” refer to the same core concept but would be seen as distinct tokens if only tokenization was used. This fragmentation can lead to inefficiencies, forcing models to learn redundant patterns for the same idea.

Researchers recognized this problem early on, realizing that a consistent way to treat word variants was needed. This led to two main approaches for word normalization: stemming and lemmatization.

What is stemming?

Stemming is a rule-based process that truncates words by removing common prefixes or suffixes. It’s quick and computationally simple, making it popular for tasks like document classification and search engine indexing. By collapsing words like “cats” and “cat” into the common stem “cat,” or “running” and “runs” into “run,” stemming consolidates morphological variants so models can learn a single representation. This drastically reduces vocabulary size in classical NLP pipelines, which can improve speed and accuracy.

Established algorithms such as the Porter stemmerA rule-based algorithm that removes common morphological endings from words to reduce them to their root form for text processing. or Snowball stemmerAn improved and more flexible version of the Porter Stemmer, supporting multiple languages with enhanced stemming rules. have been widely used in NLP for decades. They represent more refined rule sets than our simple example but still operate on similar principles. However they are not part of your journey to understand how GenAI works for now.

Press + to interact

This simple stemmer removes suffixes but doesn’t account for all linguistic nuances. For instance, “faster” remains “faster” because it doesn’t match any suffix exactly. Also, “happily” becomes “happ” and “tried” becomes “tri,” reflecting the crude but efficient nature of stemming. This highlights the limitations of basic stemming approaches, emphasizing the need for more sophisticated methods in real-world applications.

What is lemmatization?

Lemmatization takes a more sophisticated route, mapping words to their base or dictionary form (a lemma). Unlike stemming, lemmatization typically requires knowledge of a word’s part of speech and may rely on morphological analyzers or lexical databases. Lemmatization has deep origins in computational linguistics and classical philology, where scholars created tools to handle inflected forms of Latin, Greek, and other languages. As NLP matured, this linguistic know-how was integrated into text-processing pipelines for more precise normalization than stemming could offer.

Press + to interact

Whereas a stemmer might turn “better” into “bett,” a good lemmatizer recognizes that “better” can be mapped to “good.” Similarly, “running” may become “run,” and “ran” may also become “run.” This yields more linguistically accurate groupings of word variants—crucial in tasks like sentiment analysis, where subtle changes in meaning matter.

We’ll also create a very basic lemmatizer using a predefined dictionary for irregular forms. This approach demonstrates how lemmatization can accurately reduce words to their lemmas based on known irregularities.

Press + to interact

Python 3.10.4

def simple_lemmatize(word):
    # A minimal dictionary for known irregular forms.
    irregular_lemmas = {
        "running": "run",
        "happily": "happy",
        "ran": "run",
        "better": "good",
        "faster": "fast",
        "cats": "cat",
        "dogs": "dog",
        "are": "be",
        "is": "be",
        "have": "have"
    }
    return irregular_lemmas.get(word, word)
# Example usage
words = ["running", "happily", "ran", "better", "faster", "cats"]
lemmatized_words = [simple_lemmatize(word) for word in words]
print("Lemmatized Words:", lemmatized_words)

This simple lemmatizer only handles a few irregular forms and doesn’t cover the full complexity of English morphology. It illustrates the concept of lemmatization by accurately reducing known irregular words, highlighting the difference between stemming and lemmatization in handling linguistic nuances.

Which is better: stemming or lemmatization?

As we can see, both stemming and lemmatization aim to reduce words to their base or root forms, unifying variations like “run,” “runs,” and “ran.” However, they achieve this goal through different methods and with varying degrees of precision.

Stemming employs crude heuristics to truncate words, often removing common suffixes. For example, a stemmer might convert “running” to “run” or “happily” to “happi.” This method is fast and straightforward, making it ideal for applications where processing speed is critical and some loss of linguistic accuracy is acceptable. Stemming is widely used in search engines and text classification tasks where the primary goal is to group related words together efficiently.

While, lemmatization leverages linguistic knowledge and dictionaries to accurately reduce words to their lemma (dictionary form). For instance, “better” is correctly lemmatized to “good,” and “running” to “run.” Lemmatization is more accurate and context-aware, ensuring that the base forms are meaningful and linguistically correct. This makes lemmatization particularly valuable in tasks like sentiment analysis, machine translation, and any application requiring a deep understanding of language nuances.

So, which is better? The answer depends on your specific needs:

Choose stemming when you need speed and can tolerate some inaccuracies. It’s suitable for large-scale applications like indexing documents for search engines.
Choose lemmatization when you require accuracy and semantic correctness. It’s ideal for tasks that benefit from understanding the precise meaning of words.

For instance, search engines like Elasticsearch often use stemming to index documents quickly, ensuring that queries like “run,” “running,” and “ran” retrieve relevant results. In contrast, applications requiring nuanced language understanding—such as sentiment analysis in customer reviews—benefit more from lemmatization to accurately capture the sentiment expressed by different word forms.

Combining stemming and lemmatization

Additionally, while stemming and lemmatization are generally used as alternative approaches, there are rare scenarios where combining them can enhance preprocessing. For instance, you might apply stemming to quickly reduce words to a rough base form and then use lemmatization to refine these stems into accurate lemmas. However, this combined approach adds computational complexity and is uncommon in practice. Typically, selecting one method based on your task’s requirements is both sufficient and more efficient.

Here’s a fun linguistic puzzle that can trip up tokenization, stemming, and lemmatization:

“The fisherman painted a bass on the wall.”
“The fisherman listened to the deep bass of the waves.”

Here, “bass” has two entirely different meanings—one as a type of fish and the other as a low-pitched sound—demonstrating how NLP models can struggle with context if preprocessing techniques don’t account for word sense disambiguation. One modern way to address this challenge is through contextual embeddings, which capture the meaning of words based on their context within a sentence. For example, contextual embeddings help the model distinguish between the two meanings of “bass” based on surrounding words. (We will look at them in detail later in the course.)

Press + to interact

It’s important to recognize that while stemming is often characterized as a crude approach and lemmatization as a more precise one, both methods have their merits and limitations. Stemming’s simplicity makes it effective for tasks like search indexing, even if it occasionally produces imperfect results. Conversely, although lemmatization generally provides a higher degree of accuracy, it is not infallible and depends on the quality of its linguistic resources. Ultimately, the choice between stemming and lemmatization should be guided by the specific requirements and trade-offs of your application.

What are some other preprocessing techniques?

In addition to tokenization, stemming, and lemmatization, several other preprocessing techniques enhance text data quality for NLP models:

Lowercasing standardizes text by converting all words to lowercase, reducing vocabulary size and improving model efficiency.

Educative byte: There are times when “clean” text can also backfire. For example, lowercasing seems harmless, right? Not always. Take “Apple released the iPhone” vs. “I ate an apple.” Lowercasing both to “apple” erases the difference between a fruit and a trillion-dollar company. Similarly, stripping stop words like “not” from “not bad” flips the sentiment from neutral to positive!

Removing stop words eliminates common but semantically insignificant words (e.g., "the," "is"), allowing models to focus on meaningful content.
Stripping punctuation removes unnecessary symbols that add complexity without contributing to meaning in tasks like text classification.
Handling special characters and numbers ensures that only relevant elements are retained, depending on the task (e.g., keeping numbers for sentiment analysis but removing them for general text processing).
Handling contractions means expanding contractions (e.g., "don't" to "do not") can improve understanding.
Correcting misspellings means automatically fixing typos ensures consistency in the dataset.
Dealing with abbreviations and acronyms means expanding or standardizing abbreviations (e.g., "AI" to "Artificial Intelligence") can enhance clarity.

Below is an example of simple Python code (using only built‐in functions) that demonstrates these preprocessing steps step by step:

Press + to interact

Python 3.10.4

# Sample text containing various cases
text = "Apple released the iPhone! I didn't know that Apple's announcement would shock everyone. Don't you think it's amazing?"
print("Original Text:")
print(text)
print("-" * 100)
# 1. Lowercasing: Convert all text to lowercase
lower_text = text.lower()
print("After Lowercasing:")
print(lower_text)
print("-" * 100)
# 2. Tokenization: Split text into words (this simple approach splits on whitespace)
tokens = lower_text.split()
print("After Tokenization:")
print(tokens)
print("-" * 100)
# 3. Stripping Punctuation: Remove punctuation from each token
# Define a set of punctuation characters
punctuations = '.,!?\'":;()'
tokens = [token.strip(punctuations) for token in tokens]
print("After Removing Punctuation:")
print(tokens)
print("-" * 100)
# 4. Removing Stop Words: Filter out common, semantically insignificant words
stop_words = ['the', 'is', 'at', 'on', 'and', 'a', 'an', 'of', 'that', 'would', 'you', 'it']
tokens = [token for token in tokens if token not in stop_words]
print("After Removing Stop Words:")
print(tokens)
print("-" * 100)
# 5. Expanding Contractions: Replace contractions with their expanded forms
# Note: This is a simple dictionary for demonstration
contractions = {
    "didn't": "did not",
    "don't": "do not",
    "it's": "it is",
    "i'm": "i am",
    "i've": "i have",
    "apple's": "apple has"
}
expanded_tokens = []
for token in tokens:
    if token in contractions:
        # Split the expanded form to keep tokens consistent
        expanded_tokens.extend(contractions[token].split())
    else:
        expanded_tokens.append(token)
tokens = expanded_tokens
print("After Expanding Contractions:")
print(tokens)
print("-" * 100)
# 6. Handling Special Characters and Numbers:
# For this example, remove tokens that are purely numeric.
tokens = [token for token in tokens if not token.isdigit()]
print("After Handling Numbers:")
print(tokens)
print("-" * 100)
# 7. Correcting Misspellings:
# A very basic approach using a predefined dictionary of common corrections.
corrections = {
    "iphon": "iphone",  # Example: if a typo occurred
    # add more common misspellings as needed
}
tokens = [corrections.get(token, token) for token in tokens]
print("After Correcting Misspellings:")
print(tokens)
print("-" * 100)
# 8. Dealing with Abbreviations and Acronyms:
# Expand or standardize abbreviations using a simple mapping.
abbreviations = {
    "ai": "artificial intelligence",
    # add additional abbreviation mappings as needed
}
tokens = [abbreviations.get(token, token) for token in tokens]
print("After Expanding Abbreviations:")
print(tokens)
print("-" * 100)
# Final preprocessed tokens
print("Final Preprocessed Tokens:")
print(tokens)

These steps refine input data, reducing noise and inconsistencies and improving generative AI models’ accuracy and coherence.

Tackling embedded biases in real-world data

Beyond messy punctuation and unpredictable slang, real-world data often carries embedded biases—sometimes in subtle ways. Take AI-generated images of people; for example, they tend to show individuals using their right hand simply because right-handed photos are more common in training sets. The same goes for text: if most of your data comes from English-speaking sources, your model might struggle with regional dialects or languages that aren’t well-represented. Researchers address this by proactively balancing datasets—collecting images or text from diverse sources, filtering out harmful or unrepresentative examples, and applying techniques like data augmentation. While this can greatly reduce bias, there’s no fix-all solution: any real-world dataset will inevitably reflect certain imbalances found in society itself.

Press + to interact

Organizations like CommonCrawl and various academic consortia work tirelessly to gather broad, inclusive text corpora and image collections, which developers can use to build more equitable AI models. Even so, no amount of preprocessing guarantees a perfectly unbiased dataset. The best practice is an ongoing cycle: gather data from diverse users and regions, use automated and manual filters to weed out glaring biases, and continually monitor model outputs to catch new or unforeseen issues. By recognizing that bias can creep in at every stage—from initial data collection to preprocessing and training—engineers and data scientists can at least minimize its impact, ensuring more balanced and fair outcomes for all users.

Proper text preprocessing is crucial for effective NLP applications, forming the foundation for advanced techniques like bag of words (BoW), TF-IDF, word embeddings, and even more sophisticated architectures such as RNNs, Transformers, and BERT-based models. We’ll explore these advanced topics in upcoming lessons, where we’ll see how well-preprocessed text leads to better performance and more reliable outcomes in everything from sequence modeling to text generation.

Our upcoming lessons will dive into methods like TF-IDF, bag of words, and even morphological analysis—each building on these preprocessing foundations. We’ll also explore how these classical techniques set the stage for word embeddings and, eventually, the advanced sequence models and transformers that power modern AI systems like GPT. Happy learning, and get ready to explore the next layer of the NLP evolution!

Introduction to Generative AI

Building Blocks of Generative AI

Foundation Models

Generating New Music with Artificial Intelligence

Intelligent Interaction with GenAI

Practical Applications and Case Studies

Future of Generative AI and Wrap Up

Text Preprocessing Essentials

The messy world of text

Why does text preprocessing matter?

What is tokenization?

What is stemming?

What is lemmatization?

Which is better: stemming or lemmatization?

Combining stemming and lemmatization

What are some other preprocessing techniques?

Tackling embedded biases in real-world data