Overview

N-grams in text preprocessing are sequences of $n$ number of items, such as words or characters, extracted from text data. They help address the challenge of capturing linguistic relationships and context in text data. For example, by extracting sequences of adjacent items, such as words or characters, n-grams enable models to understand the associations between elements with a deeper context. This is particularly true for sentiment analysis tasks, where capturing phrases such as “not good” is crucial for understanding negation. Additional benefits of n-grams include enhancing text classification by considering the co-occurrence of words and improving the accuracy of machine translation by considering word sequences. Here are common types of n-grams represented in a table:

Common Types of N-Grams

N-Gram Type	Description	Example	Use in Text Preprocessing
Unigrams (1-grams)	Single words or characters in a sequence	“The” “dog” “is” “sleeping”	We use them as basic features for simple tasks or to analyze word frequency.
Bigrams (2-grams)	Pairs of adjacent words or characters in a sequence	“The dog” “dog is” “is sleeping”	We use them to capture immediate word relationships. They’re useful for tasks like sentiment analysis and language modeling.
Trigrams (3-grams)	Groups of three adjacent words or characters in a sequence	“The dog is” “dog is sleeping”	These offer more context than bigrams. They’re useful for tasks like language modeling and certain machine translation models.
Quadgrams (4-grams)	Groups of four adjacent words or characters in a sequence	“The dog is sleeping” “dog is sleeping on”	They capture longer contextual patterns. They’re helpful in scenarios like certain machine translation tasks.

Limitations

While using n-grams offers benefits, there are also limitations:

As the length of n-grams increases, the number of possible combinations grows exponentially, leading to high-dimensional feature spaces. This can result in increased memory and computational requirements. However, we can overcome such a limitation by using feature selection techniques that retain only the most informative n-grams.
While n-grams are useful for capturing local patterns of language, they often fail to capture broader contextual information. For instance, consider the trigram “not good enough.” On its own, this trigram might suggest a negative sentiment. However, without considering the surrounding context, it’s challenging to determine the sentiment accurately. It could be a sentence like “The product was not good enough, but the customer service was excellent.” In this case, the overall sentiment is positive, but the trigram alone can lead to a misinterpretation. To address this limitation, we can use more advanced language models and techniques like word embeddings to capture richer semantic relationships and contextual information.
Overfitting is a concern when using n-grams, ...

About This Course

Introduction To Text Preprocessing

Regular Expressions

Irrelevant Text Data

Basic Text Preprocessing Techniques

Indexing

Text Transformation

Text Representation

Text Feature Engineering

Advanced Text Preprocessing

N-grams

Text Classification of Customer Reviews

Conclusion

Text Classification Using PyTorch

Introduction to N-Grams

Overview

Common Types of N-Grams

Limitations