What is a bigram language model?

Key takeaways:

  • A bigram language model predicts the next word based on the previous word.

  • It simplifies complex dependencies using the Markov assumption.

  • The model uses vocabulary, bigrams, and probability distributions to calculate word likelihoods.

  • It is easy to implement but only captures short-range word dependencies.

  • Large datasets are required to avoid issues with sparse data.

  • Bigram models are used in text generation, speech recognition, translation, and spelling correction.

  • They are the foundation for advanced models like trigrams and neural networks.

Bigram language model

A bigram language model is a statistical language model used in natural language processing (NLP) to predict the likelihood of a word in a sequence based on the preceding word. It is a simple yet powerful approach to modeling language, focusing on pairs of consecutive words (bigrams) to capture local word dependencies.

How does a bigram model work?

The bigram model assumes that the probability of a word depends only on the word immediately preceding it. This is an application of the Markov assumption, which simplifies complex sequential data by reducing the dependency horizon to a fixed size (one in the case of a bigram model).

Mathematically, for a sequence of words w1, w2,…, wn​, the probability of the sequence is approximated as:

Here, P(wi​ | wi−1​) is the probability of word wi​ given the preceding word wi−1​.

Key components of a bigram model

  1. Vocabulary: A finite set of words to train the model. Words not in the vocabulary are often replaced with a placeholder like <UNK> for “unknown.”

  2. Bigrams: Pairs of consecutive words in a text corpus, such as “natural language” or “language processing.”

  3. Probability distribution: The model calculates the conditional probability of each bigram P(wi​ | wi−1​) from a given dataset.

Example

Suppose we have a simple sentence:
“I love programming languages."

The bigrams extracted from this sentence are:

  • ("I", "love")

  • ("love", "programming")

  • ("programming", "languages")

Using a bigram model, the probability of the sentence is computed as:

Code: Bigram language model implementation

Here’s an example of a bigram language model in Python. This example processes a custom dataset, builds the bigram model, calculates probabilities, and generates text using the model.

import random
from collections import defaultdict
# Step 1: Dataset preparation
def prepare_data():
sentences = [
"The cat sat on the mat",
"The dog barked at the cat",
"The bird sang a song",
"The cat chased the mouse",
"The dog and the bird played together"
]
tokenized_sentences = [sentence.split() for sentence in sentences]
return tokenized_sentences
# Step 2: Build the bigram model
def build_bigram_model(sentences):
bigram_counts = defaultdict(int)
unigram_counts = defaultdict(int)
for sentence in sentences:
for i in range(len(sentence) - 1):
bigram = (sentence[i], sentence[i + 1])
bigram_counts[bigram] += 1
unigram_counts[sentence[i]] += 1
unigram_counts[sentence[-1]] += 1 # Count the last word in the sentence
bigram_probabilities = {}
for bigram, count in bigram_counts.items():
bigram_probabilities[bigram] = count / unigram_counts[bigram[0]]
return bigram_probabilities
# Step 3: Generate text using the bigram model
def generate_text(bigram_probabilities, start_word, num_words=10):
current_word = start_word
generated_text = [current_word]
for _ in range(num_words - 1):
# Filter bigrams that start with the current word
candidates = {bigram: prob for bigram, prob in bigram_probabilities.items() if bigram[0] == current_word}
if not candidates:
break # Stop if no valid bigram is found
# Choose the next word based on probabilities
next_word = random.choices(list(candidates.keys()), weights=list(candidates.values()))[0][1]
generated_text.append(next_word)
current_word = next_word
return " ".join(generated_text)
# Main execution
if __name__ == "__main__":
sentences = prepare_data()
bigram_probabilities = build_bigram_model(sentences)
print("Bigram Probabilities:")
for bigram, prob in bigram_probabilities.items():
print(f"{bigram}: {prob:.2f}")
print("\nGenerated Text:")
start_word = "The"
generated_text = generate_text(bigram_probabilities, start_word)
print(generated_text)

Explanation

  • Data preparation: The function prepare_data splits a list of sentences into tokenized word lists.

  • Bigram model: The function build_bigram_model computes the probabilities of each bigram by dividing the frequency of a bigram by the frequency of its first word (unigram).

  • Text generation: The function generate_text starts with a given word and iteratively selects the next word based on bigram probabilities until the desired number of words is generated.

The bigram model predicts the next word by considering the probabilities of bigrams that start with the current word. Higher probabilities make a word more likely to follow, as the output demonstrates. For instance:

  • "The" is likely followed by "cat" or "dog" (0.40 each).

  • "cat" is equally likely followed by "sat" or "chased" (0.33 each).

Advantages of bigram models

  • Simplicity: Easy to implement and computationally efficient.

  • Local context: This captures short-range dependencies, making it useful for many applications, such as spell-checking and text generation.

  • Foundation for larger models: Bigram models are the basis for more complex models, such as trigrams or neural network-based models.

Limitations of bigram models

  • Limited context: Only considers the immediately preceding word, which can lead to inaccurate predictions for longer dependencies.

  • Data sparsity: Requires a large dataset to estimate probabilities accurately, as many bigrams may not occur frequently enough in small corpora.

  • Memory usage: Storing probabilities for all possible bigrams can become memory-intensive for large vocabularies.

Applications of bigram models

Improvements over bigram models

While bigram models are a good starting point, more advanced models like trigram models (considering two previous words) or n-gram models (generalizing to n-word contexts) provide greater accuracy. Today, neural network-based models such as transformers (e.g., GPT, BERT) have largely surpassed bigram models in performance.

Conclusion

A bigram language model is a foundational concept in NLP, demonstrating how local word dependencies can be used to model language. Although it has limitations, it provides a simple and intuitive framework for understanding more advanced language modeling techniques. Whether in text prediction, generation, or analysis, bigram models have laid the groundwork for modern advancements in natural language processing.

Curious about creating custom language models from the ground up? Dive into the Create Your Own Language Models from Scratch project.

Frequently asked questions

Haven’t found what you were looking for? Contact Us


What is the difference between n-gram and bigram?

An n-gram is a sequence of ‘n’ consecutive words, while a bigram is specifically an n-gram with ‘n’ equal to 2, focusing only on word pairs.


What is the difference between bigram and trigram model?

A bigram model considers pairs of consecutive words, while a trigram model uses sequences of three consecutive words, capturing more context.


What is the most common bigram in the English language?

The most common bigram in English is often “of the,” based on frequency in text corpora.


How to use bigrams in NLTk

Use nltk.bigrams() to generate bigrams from a list of words. For example:

from nltk import bigrams
list(bigrams(['this', 'is', 'a', 'test']))

What is unigram vs. bigram models?

The unigram model treats each word as an independent unit, predicting word probabilities without considering context. In contrast, the bigram model looks at pairs of consecutive words, predicting the probability of a word based on the previous word, capturing some contextual information.


Free Resources

Attributions:
  1. undefined by undefined
Copyright ©2025 Educative, Inc. All rights reserved