Continual word, symbol, or token sequences are known as n-gram representations. They are the adjacent groups of items in a document. In natural language processing (NLP) tasks, they are relevant when we deal with textual data.
n
is a positive integer variable that can have values like 1
, 2
, 3
, 4
, and so on.
Depending on the value of n
, n-grams have the following different types or categories:
Unigrams are a type of n-gram where the value of n
is 1
. Unigram means taking only one word or token at a time.
Example:
Text = “Educative is the best platform”
The unigram for the above text is as follows:
[“Educative”, “is”, “the”, “best”, “platform”]
Bigrams are a type of n-gram where the value of n
is 2
. Bigram means taking two words or tokens at a time.
Example:
Text = “Educative is the best platform”
The bigram for the above text is as follows:
[“Educative is”, “is the”, “the best”, “best platform”]
Trigrams are a type of n-gram where the value of n
is 3
. Trigram means taking three words or tokens at a time.
Example:
text = “Educative is the best platform”
The trigram for the above text is as follows:
[“Educative is the”, “is the best”, “the best platform”]
n-grams can be defined for any given value of n
.
Let us consider n
to be 4
. This means taking fours words or tokens at a time.
Example:
text = “Educative is the best platform”
The 4-gram for the above text is as follows:
[“Educative is the best”, “is the best platform”]
import redef n_gram(text, n=1):text = text.lower()text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)gram_tokens = [token for token in text.split(" ") if token != ""]ngrams = zip(*[gram_tokens[i:] for i in range(n)])return [" ".join(ngram) for ngram in ngrams]def unigram(text):print("Unigram")print(n_gram(text, 1))def bigram(text):print("Bigram")print(n_gram(text, 2))def trigram(text):print("Trigram")print(n_gram(text, 3))if __name__ == "__main__":text = "Educative is the best platform"unigram(text)bigram(text)trigram(text)
re
module.n_gram()
method. This generates the n-gram for the given text and the n
value.unigram()
method. This generates
the unigram representation of the text by invoking the n_gram()
method with n=1
.bigram()
method. This generates
the bigram representation of the text by invoking the n_gram()
method with n=2
.trigram()
method. This generates
the trigram representation of the text by invoking the n_gram()
method with n=3
.unigram()
method.bigram()
method.trigram()
method.