What is an n-gram representation?

Continual word, symbol, or token sequences are known as n-gram representations. They are the adjacent groups of items in a document. In natural language processing (NLP) tasks, they are relevant when we deal with textual data.

n is a positive integer variable that can have values like 1, 2, 3, 4, and so on.

Depending on the value of n, n-grams have the following different types or categories:

  1. Unigram
  2. Bigram
  3. Trigram
  4. n-gram

Unigram

Unigrams are a type of n-gram where the value of n is 1. Unigram means taking only one word or token at a time.

Example:

Text = “Educative is the best platform”

The unigram for the above text is as follows:

[“Educative”, “is”, “the”, “best”, “platform”]

Bigram

Bigrams are a type of n-gram where the value of n is 2. Bigram means taking two words or tokens at a time.

Example:

Text = “Educative is the best platform”

The bigram for the above text is as follows:

[“Educative is”, “is the”, “the best”, “best platform”]

Trigram

Trigrams are a type of n-gram where the value of n is 3. Trigram means taking three words or tokens at a time.

Example:

text = “Educative is the best platform”

The trigram for the above text is as follows:

[“Educative is the”, “is the best”, “the best platform”]

n-gram

n-grams can be defined for any given value of n.

Let us consider n to be 4. This means taking fours words or tokens at a time.

Example:

text = “Educative is the best platform”

The 4-gram for the above text is as follows:

[“Educative is the best”, “is the best platform”]

import re
def n_gram(text, n=1):
text = text.lower()
text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
gram_tokens = [token for token in text.split(" ") if token != ""]
ngrams = zip(*[gram_tokens[i:] for i in range(n)])
return [" ".join(ngram) for ngram in ngrams]
def unigram(text):
print("Unigram")
print(n_gram(text, 1))
def bigram(text):
print("Bigram")
print(n_gram(text, 2))
def trigram(text):
print("Trigram")
print(n_gram(text, 3))
if __name__ == "__main__":
text = "Educative is the best platform"
unigram(text)
bigram(text)
trigram(text)

Explanation

  • Line 1: We import the re module.
  • Line 3: We define the n_gram() method. This generates the n-gram for the given text and the n value.
  • Line 4: The text is converted to lowercase.
  • Line 5: The non-alphanumeric characters in the text are replaced with space.
  • Line 6: The tokens are generated by splitting the text by the space character.
  • Lines 7–8: The n-grams are generated and returned as a list.
  • Lines 10–12: We define the unigram() method. This generates the unigram representation of the text by invoking the n_gram() method with n=1.
  • Lines 14–16: We define the bigram() method. This generates the bigram representation of the text by invoking the n_gram() method with n=2.
  • Lines 18–20: We define the trigram() method. This generates the trigram representation of the text by invoking the n_gram() method with n=3.
  • Line 23: We define the text.
  • Line 24: We invoke the unigram() method.
  • Line 25: We invoke the bigram() method.
  • Line 26: We invoke the trigram() method.

Free Resources