The n-gram model is a probabilistic language model that can predict the next item in a sequence using the (n − 1)–order Markov model. Let’s understand that better with an example. Consider the following sentence:
“I love reading blogs on Educative to learn new concepts”#
A 1-gram (or unigram) is a one-word sequence. For the above sentence, the unigrams would simply be: “I”, “love”, “reading”, “blogs”, “on”, “Educative”, “and”, “learn”, “new”, “concepts”.
A 2-gram (or bigram) is a two-word sequence of words, like “I love”, “love reading”, “on Educative” or “new concepts”.
Lastly, a 3-gram (or trigram) is a three-word sequence of words, like “I love reading”, “blogs on Educative”, or “learn new concepts”.
An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. If we have a good N-gram model, we can predict p(w∣h), or the probability of seeing the word w given a history of previous words h, where the history contains n-1 words.
Example: “I love reading ___”. Here, we want to predict what word will fill the dash based on the probabilities of the previous words.
We must estimate this probability to construct an N-gram model. We compute this probability in two steps:
- Apply the chain rule of probability
- We then apply a very strong simplification assumption to allow us to compute p(w1…ws) in an easy manner.
The chain rule of probability is:
p(w1...ws)=p(w1).p(w2∣w1).p(w3∣w1w2).p(w4∣w1w2w3).....p(wn∣w1...wn−1)
Definition: What is the chain rule? It tells us how to compute the joint probability of a sequence by using the conditional probability of a word given previous words.
Here, we do not have access to these conditional probabilities with complex conditions of up to n-1 words. So, how do we proceed? This is where we introduce a simplification assumption. We can assume for all conditions, that:
p(wk∣w1...wk−1)=p(wk∣wk−1)
Here, we approximate the history (the context) of the word wk by looking only at the last word of the context. This assumption is called the Markov assumption. It is an example of the Bigram model. The same concept can be enhanced further for example for trigram model the formula will be:
p(wk∣w1...wk−1)=p(wk∣wk−2wk−1)
These models have a basic problem: they give the probability to zero if an unknown word is seen, so the concept of smoothing is used. In smoothing we assign some probability to the unseen words. There are different types of smoothing techniques such as Laplace smoothing, Good Turing, and Kneser-ney smoothing.