Large Language Models
Get introduced to language models and large language models.
Overview
Let’s imagine a conversation with a friend, where the friend starts a sentence with “I’m going to make a cup of ________.” Humans would likely predict that the next word could be “coffee” or “tea” based on their knowledge of common beverage choices.
Similarly, a language model is trained to understand and predict the next word in a sequence based on the context of the preceding words. It learns from vast amounts of text data and can make informed predictions about what word will likely come next in a given context.
Before going into more detail, let’s first discuss what language models are.
Language models
A language model (LM) can be defined as a probabilistic model that assigns probabilities to sequences of words or tokens in a given language. The goal is to capture the structure and patterns of the language to predict the likelihood of a particular sequence of words.
Let’s assume we have a vocabulary
The probability of the entire sequence can be expressed as follows:
Example
Assume we have
Note: LMs must have external knowledge for them to be able to assign meaningful probabilities; therefore, they are trained. During this training process, the model learns to assign higher probabilities to words more likely to follow a given context. After training, the LM can generate text by sampling words based on these learned probabilities.
Prediction
We can also predict a word given a sequence. An LM estimates this probability by considering the conditional probabilities of each word given the previous words in the sequence. Using the chain rule of probability, the joint probability of the sequence can be decomposed as:
For example:
In practice, accurately modeling these conditional probabilities is a complex task. Modern LMs, such as transformer-based models like GPT-3, utilize deep learning techniques to capture intricate patterns and dependencies in the data.
N-gram language model
N-gram models are a type of probabilistic LM used in natural language processing and computational linguistics. These models are based on the idea that the probability of a word depends on the previous
For example, consider the following sentence: I love language models.
Unigram (1-gram): “I,” “love,” “language,” “models”
Bigram (2-gram): “I love,” “love language,” “language models”
Trigram (3-gram): “I love language,” “love language models”
4-gram: “I love language models”
N-gram models are simple and computationally efficient, making them suitable for various natural language processing tasks. However, their limitations include the inability to capture long-range dependencies in language and the sparsity problem when dealing with higher-order n-grams.
Note: More advanced LMs, such as recurrent neural networks (RNNs), have been replaced by LLMs.
Its algorithm is as follows:
Tokenization: Split the input text into individual words or tokens.
N-gram generation: Create n-grams by forming sequences of
consecutive words from the tokenized text. Frequency counting: Count the occurrences of each n-gram in the training corpus.
Probability estimation: Calculate the conditional probability of each word given its previous
words using the frequency counts. Smoothing (optional): Apply smoothing techniques to handle unseen n-grams and avoid zero probabilities.
Text generation: Start with an initial seed of
words, predict the next word based on probabilities, and iteratively generate the next words to form a sequence. Repeat generation: Continue generating words until the desired length or a stopping condition is reached.
Let’s see an example in action:
import randomclass NGramLanguageModel:def __init__(self, n):self.n = nself.ngrams = {}self.start_tokens = ['<start>'] * (n - 1)def train(self, corpus):for sentence in corpus:tokens = self.start_tokens + sentence.split() + ['<end>']for i in range(len(tokens) - self.n + 1):ngram = tuple(tokens[i:i + self.n])if ngram in self.ngrams:self.ngrams[ngram] += 1else:self.ngrams[ngram] = 1def generate_text(self, seed_text, length=10):seed_tokens = seed_text.split()padded_seed_text = self.start_tokens[-(self.n - 1 - len(seed_tokens)):] + seed_tokensgenerated_text = list(padded_seed_text)current_ngram = tuple(generated_text[-self.n + 1:])for _ in range(length):next_words = [ngram[-1] for ngram in self.ngrams.keys() if ngram[:-1] == current_ngram]if next_words:next_word = random.choice(next_words)generated_text.append(next_word)current_ngram = tuple(generated_text[-self.n + 1:])else:breakreturn ' '.join(generated_text[len(self.start_tokens):])# Toy corpustoy_corpus = ["This is a simple example.","The example demonstrates an n-gram language model.","N-grams are used in natural language processing.","This is a toy corpus for language modeling."]n = 3 # Change n-gram order here# Example usage with seed textmodel = NGramLanguageModel(n)model.train(toy_corpus)seed_text = "This" # Change seed text heregenerated_text = model.generate_text(seed_text, length=3)print("Seed text:", seed_text)print("Generated text:", generated_text)
Explanation
Line 1: We import the
random
module to facilitate random choices during text generation.Line 3: We define a class named
NGramLanguageModel
to encapsulate the functionality of the n-gram LM.Lines 4–7: We define the constructor method for the class, which initializes the n-gram order
n
, the n-gram frequency dictionaryngrams
, and a list of start tokens for padding the beginning of sentences. Then, we set the class attributes, i.e., the n-gram ordern
, the empty dictionary to store n-gram frequenciesngrams
, and the list of start tokens used for paddingstart_tokens
. Thestart_token
class attribute provides context for the beginning of sentences where there aren’t enough preceding words to form a complete n-gram. This ensures coherent and consistent text generation.Lines 9–17: We define a method named
train
to train the LM on a given corpus. Then, we iterate through each sentence in the provided corpus. We tokenize the sentence by adding start tokens, splitting it into individual words, and appending an end token. Moreover, we iterate through the sentence to create n-grams by considering sequences of lengthn
. We extract the current n-gram as a tuple from the token sequence and update the frequency count of the current n-gram in thengrams
dictionary.Lines 19–34: We define a method named
generate_text
to generate text based on the trained LM, starting with a seed text.Lines 37–53: We define a corpus for training and testing the LM. Then, we create an instance of the
NGramLanguageModel
class with n-gram ordern=2
and train it on the corpus. Next, we specify a seed text, generate text based on the trained model, and print both the seed text and the generated text.
Large language models
Large language models (LLMs) refer to advanced natural language processing models trained on massive amounts of textual data. These models are designed to understand and generate human-like text based on the input they receive.
Comparison with simpler LMs
LLMs and simpler LMs differ primarily in scale, complexity, and the task they are designed to perform. Here’s a comparison between LLMs and simpler models:
LLMs vs. LMs
Aspect | LLMs | LMs |
Scale and Parameters | Tens to hundreds of billions of parameters | Millions of parameters |
Training Data | Trained on vast and diverse datasets from the internet | Can be trained on smaller, domain-specific datasets |
Versatility | Highly versatile, excelling across various natural language processing tasks | Task-specific, might require more fine-tuning |
Computational Resources | Demands significant computational power and specialized hardware | More computationally efficient, accessible on standard hardware |
Use Cases | Complex language understanding, translation, summarization, creative writing | Specific tasks like sentiment analysis and named entity recognition |
Now, let’s take a quiz to revisit the concepts taught in this lesson.
Quiz
Read the question statement, and then select the correct answer from the given choices.
What is an LM?
A set of grammar rules and guidelines used for teaching a language
A probabilistic model that assigns probabilities to sequences of words or tokens in a given language
Software that translates text from one language to another
A database of definitions and synonyms for words in a specific language