Introduction to N-Grams
Learn what n-grams are and how to implement bigrams using Python.
We'll cover the following...
Overview
N-grams in text preprocessing are sequences of
Common Types of N-Grams
N-Gram Type | Description | Example | Use in Text Preprocessing |
Unigrams (1-grams) | Single words or characters in a sequence | “The” “dog” “is” “sleeping” | We use them as basic features for simple tasks or to analyze word frequency. |
Bigrams (2-grams) | Pairs of adjacent words or characters in a sequence | “The dog” “dog is” “is sleeping” | We use them to capture immediate word relationships. They’re useful for tasks like sentiment analysis and language modeling. |
Trigrams (3-grams) | Groups of three adjacent words or characters in a sequence | “The dog is” “dog is sleeping” | These offer more context than bigrams. They’re useful for tasks like language modeling and certain machine translation models. |
Quadgrams (4-grams) | Groups of four adjacent words or characters in a sequence | “The dog is sleeping” “dog is sleeping on” | They capture longer contextual patterns. They’re helpful in scenarios like certain machine translation tasks. |
Limitations
While using n-grams offers benefits, there are also limitations:
As the length of n-grams increases, the number of possible combinations grows exponentially, leading to high-dimensional feature spaces. This can result in increased memory and computational requirements. However, we can overcome such a limitation by using feature selection techniques that retain only the most informative n-grams.
While n-grams are useful for capturing local patterns of language, they often fail to capture broader contextual information. For instance, consider the trigram “not good enough.” On its own, this trigram might suggest a negative sentiment. However, without considering the surrounding context, it’s challenging to determine the sentiment accurately. It could be a sentence like “The product was not good enough, but the customer service was excellent.” In this case, the overall sentiment is positive, but the trigram alone can lead to a ...