An n-gram is a sequence of ‘n’ consecutive words, while a bigram is specifically an n-gram with ‘n’ equal to 2, focusing only on word pairs.
Key takeaways:
A bigram language model predicts the next word based on the previous word.
It simplifies complex dependencies using the Markov assumption.
The model uses vocabulary, bigrams, and probability distributions to calculate word likelihoods.
It is easy to implement but only captures short-range word dependencies.
Large datasets are required to avoid issues with sparse data.
Bigram models are used in text generation, speech recognition, translation, and spelling correction.
They are the foundation for advanced models like trigrams and neural networks.
A bigram language model is a statistical language model used in natural language processing (NLP) to predict the likelihood of a word in a sequence based on the preceding word. It is a simple yet powerful approach to modeling language, focusing on pairs of consecutive words (bigrams) to capture local word dependencies.
The bigram model assumes that the probability of a word depends only on the word immediately preceding it. This is an application of the Markov assumption, which simplifies complex sequential data by reducing the dependency horizon to a fixed size (one in the case of a bigram model).
Mathematically, for a sequence of words w1, w2,…, wn, the probability of the sequence is approximated as:
Here, P(wi | wi−1) is the probability of word wi given the preceding word wi−1.
Vocabulary: A finite set of words to train the model. Words not in the vocabulary are often replaced with a placeholder like <UNK>
for “unknown.”
Bigrams: Pairs of consecutive words in a text corpus, such as “natural language” or “language processing.”
Probability distribution: The model calculates the conditional probability of each bigram P(wi | wi−1) from a given dataset.
Suppose we have a simple sentence:
“I love programming languages."
The bigrams extracted from this sentence are:
("I", "love")
("love", "programming")
("programming", "languages")
Using a bigram model, the probability of the sentence is computed as:
Here’s an example of a bigram language model in Python. This example processes a custom dataset, builds the bigram model, calculates probabilities, and generates text using the model.
import randomfrom collections import defaultdict# Step 1: Dataset preparationdef prepare_data():sentences = ["The cat sat on the mat","The dog barked at the cat","The bird sang a song","The cat chased the mouse","The dog and the bird played together"]tokenized_sentences = [sentence.split() for sentence in sentences]return tokenized_sentences# Step 2: Build the bigram modeldef build_bigram_model(sentences):bigram_counts = defaultdict(int)unigram_counts = defaultdict(int)for sentence in sentences:for i in range(len(sentence) - 1):bigram = (sentence[i], sentence[i + 1])bigram_counts[bigram] += 1unigram_counts[sentence[i]] += 1unigram_counts[sentence[-1]] += 1 # Count the last word in the sentencebigram_probabilities = {}for bigram, count in bigram_counts.items():bigram_probabilities[bigram] = count / unigram_counts[bigram[0]]return bigram_probabilities# Step 3: Generate text using the bigram modeldef generate_text(bigram_probabilities, start_word, num_words=10):current_word = start_wordgenerated_text = [current_word]for _ in range(num_words - 1):# Filter bigrams that start with the current wordcandidates = {bigram: prob for bigram, prob in bigram_probabilities.items() if bigram[0] == current_word}if not candidates:break # Stop if no valid bigram is found# Choose the next word based on probabilitiesnext_word = random.choices(list(candidates.keys()), weights=list(candidates.values()))[0][1]generated_text.append(next_word)current_word = next_wordreturn " ".join(generated_text)# Main executionif __name__ == "__main__":sentences = prepare_data()bigram_probabilities = build_bigram_model(sentences)print("Bigram Probabilities:")for bigram, prob in bigram_probabilities.items():print(f"{bigram}: {prob:.2f}")print("\nGenerated Text:")start_word = "The"generated_text = generate_text(bigram_probabilities, start_word)print(generated_text)
Data preparation: The function prepare_data
splits a list of sentences into tokenized word lists.
Bigram model: The function build_bigram_model
computes the probabilities of each bigram by dividing the frequency of a bigram by the frequency of its first word (unigram).
Text generation: The function generate_text
starts with a given word and iteratively selects the next word based on bigram probabilities until the desired number of words is generated.
The bigram model predicts the next word by considering the probabilities of bigrams that start with the current word. Higher probabilities make a word more likely to follow, as the output demonstrates. For instance:
"The" is likely followed by "cat" or "dog" (0.40 each).
"cat" is equally likely followed by "sat" or "chased" (0.33 each).
Simplicity: Easy to implement and computationally efficient.
Local context: This captures short-range dependencies, making it useful for many applications, such as spell-checking and text generation.
Foundation for larger models: Bigram models are the basis for more complex models, such as trigrams or neural network-based models.
Limited context: Only considers the immediately preceding word, which can lead to inaccurate predictions for longer dependencies.
Data sparsity: Requires a large dataset to estimate probabilities accurately, as many bigrams may not occur frequently enough in small corpora.
Memory usage: Storing probabilities for all possible bigrams can become memory-intensive for large vocabularies.
Text generation: Used to generate text by predicting the next word based on the previous word.
Speech recognition: Helps improve word prediction in speech-to-text systems.
Machine translation: Assists in predicting word sequences in target languages.
Spelling correction: Identifies contextually correct word replacements.
While bigram models are a good starting point, more advanced models like trigram models (considering two previous words) or n-gram models (generalizing to n-word contexts) provide greater accuracy. Today, neural network-based models such as transformers (e.g., GPT, BERT) have largely surpassed bigram models in performance.
A bigram language model is a foundational concept in NLP, demonstrating how local word dependencies can be used to model language. Although it has limitations, it provides a simple and intuitive framework for understanding more advanced language modeling techniques. Whether in text prediction, generation, or analysis, bigram models have laid the groundwork for modern advancements in natural language processing.
Curious about creating custom language models from the ground up? Dive into the Create Your Own Language Models from Scratch project.
Haven’t found what you were looking for? Contact Us
Free Resources