...

/

Introduction to N-Grams

Introduction to N-Grams

Learn what n-grams are and how to implement bigrams using Python.

Overview

N-grams in text preprocessing are sequences of nn number of items, such as words or characters, extracted from text data. They help address the challenge of capturing linguistic relationships and context in text data. For example, by extracting sequences of adjacent items, such as words or characters, n-grams enable models to understand the associations between elements with a deeper context. This is particularly true for sentiment analysis tasks, where capturing phrases such as “not good” is crucial for understanding negation. Additional benefits of n-grams include enhancing text classification by considering the co-occurrence of words and improving the accuracy of machine translation by considering word sequences. Here are common types of n-grams represented in a table:

Common Types of N-Grams

N-Gram Type

Description

Example

Use in Text Preprocessing

Unigrams (1-grams)

Single words or characters in a sequence

“The”

“dog”

“is”

“sleeping”

We use them as basic features for simple tasks or to analyze word frequency.

Bigrams (2-grams)

Pairs of adjacent words or characters in a sequence

“The dog”

“dog is”

“is sleeping”

We use them to capture immediate word relationships. They’re useful for tasks like sentiment analysis and language modeling.

Trigrams (3-grams)

Groups of three adjacent words or characters in a sequence

“The dog is”

“dog is sleeping”

These offer more context than bigrams. They’re useful for tasks like language modeling and certain machine translation models.

Quadgrams (4-grams)

Groups of four adjacent words or characters in a sequence

“The dog is sleeping”

“dog is sleeping on”

They capture longer contextual patterns. They’re helpful in scenarios like certain machine translation tasks.

Limitations

While using n-grams offers benefits, there are also limitations:

  • As the length of n-grams increases, the number of possible combinations grows exponentially, leading to high-dimensional feature spaces. This can result in increased memory and computational requirements. However, we can overcome such a limitation by using feature selection techniques that retain only the most informative n-grams.

  • While n-grams are useful for capturing local patterns of language, they often fail to capture broader contextual information. For instance, consider the trigram “not good enough.” On its own, this trigram might suggest a negative sentiment. However, without considering the surrounding context, it’s challenging to determine the sentiment accurately. It could be a sentence like “The product was not good enough, but the customer service was excellent.” In this case, the overall sentiment is positive, but the trigram alone can lead to a ...

Access this course and 1400+ top-rated courses and projects.