Home/Blog/Generative Ai/The evolution of GPTs
Home/Blog/Generative Ai/The evolution of GPTs

The evolution of GPTs

Nov 15, 2023
14 min read
content
Word meanings
Language models
Training language models
Text generation
But wait a minute!
Word embeddings
Buy one, get one free!
The probability model
Example
Generation using RNNs
Sampling the next word
Transformers
Self-attention
How self-attention works
Positional encoding
Training a transformer
Generative Pre-trained Transformer (GPT)
The OpenAI GPT series
GPT-1
GPT-2
GPT-3
Beyond GPT-3
Conclusion
share

This blog is an attempt to unfold the historical developments within computational linguistics that helped give birth to ChatGPT. Like us humans, do computers also need to know what words mean?

Word meanings

Fluent speakers possess vast knowledge, primarily reflected in their vocabulary. This knowledge includes the grammatical function, meaning, real-world reference, and pragmatic use of words. While estimates of adult vocabulary size vary, it is agreed that most words used by mature speakers are acquired early in life through spoken interactions. This active vocabulary remains limited compared to the adult vocabulary, leaving many words to be acquired through other means. Children achieve remarkable vocabulary growth rates by learning approximately 7 to 10 words daily, with reading playing a significant role in this process. A renowned principle within linguistics, known as the distributional hypothesis, suggests that word meanings can be learned from text alone, based on the associations between words and their co-occurring words. This happens because synonymous words often appear in similar contexts or alongside similar words in written text. A class of machine learning models called language models has proven effective in capturing this type of knowledge from vast amounts of text. These language models can be utilized for various natural language processing tasks, including sentiment analysis, language translations, text summarization, chatbot Q&A, text generation, etc.

This blog will center around the basics of language models that use a specific architecture and provide a foundation for understanding the much-hyped GPT (Generative Pre-trained Transformer). Subsequently, we will explore the evolution of GPT models as it happened over the years.

Language models

In the context of natural language processing (NLP), a language model is a computational model designed to understand and generate human language. It is trained on a large corpus of text data and learns the language’s statistical patterns, relationships, and structures. The main goal of a language model is to predict the probability of the next word in a sequence of words, given the previous words. In particular, if w1,w2,,wnw_1,w_2,\dots,w_n is a sequence of nn, words, then probability of the next world, wn+1w_{n+1}, using Bayes rule, can be defined as follows:

P(wn+1wn,wn1,,w1)=P(wn+1,wn,wn1,,w1)P(wn,wn1,,w1)P(w_{n+1}|w_n,w_{n-1},\dots,w_1)=\frac{P(w_{n+1},w_n,w_{n-1},\dots,w_1)}{P(w_n,w_{n-1},\dots,w_1)}

Training language models

Generating the next word based on a sequence of words becomes a matter of sampling from the probability distribution once the language model is accessible.

However, the initial challenge lies in estimating the probability model itself. Constructing such a language model necessitates an adequate amount of training data, which, fortunately, is readily available nowadays.

Essentially, n-grams are chunks of words in a row, where ‘n’ represents the number of words in each chunk. Imagine you have a sentence like “I love ice cream.” If we break this into 2-grams (also known as bigrams), we get the following pairs of words: “I love,” “love ice,” and “ice cream.” For 3-grams (trigrams), we have groups of three words: “I love ice” and “love ice cream.” The provided code snippet demonstrates the implementation of the n-gram approach.

# Creating a function to generate n-grams
def make_training_samples(text, context_length):
words = text.split()
output = []
# Iterate through the text to generate n-grams
for i in range(len(words) - context_length + 1):
output.append(words[i:i + context_length])
return output
# Calling the function
text = 'How old are you right now ?'
# Generate training samples for different context lengths
for context_length in range(2, len(text.split()) + 1):
print(f'Context Length = {context_length - 1}')
# Generate training samples using the make_training_samples function
samples = make_training_samples(text, context_length=context_length)
# Print each training sample
for sample in samples:
print(sample)

After preparing the training data from a substantial corpus of text, the next step is to estimate the probability model, which is referred to as language model training. This process involves training the language model on the prepared dataset, allowing it to learn the statistical patterns and relationships in the language.

Text generation

Once the language model is trained in this way on a text corpus, it gains the ability to sample from the learned probability distribution. This sampling process enables the language model to generate coherent and contextually relevant text, as it can draw upon its learned knowledge of language patterns and structures.

But wait a minute!

Before delving further into the details of how language models generate text, some questions need to be asked:

  • How is a word represented in a language model?
  • How is the estimation of probabilities conducted using prepared training data?
  • Which models are used that are insensitive to the length of the sequence?
  • How is the generation of the next word accomplished, and what does the sampling process entail?

Let’s address these questions individually and provide a comprehensive explanation for each one.

Word embeddings

In computational models, words are typically represented using numerical vectors or embeddings, instead of strings or characters. These embeddings capture the semantic and syntactic properties of words, allowing them to be processed mathematically.

Learning the optimal word embedding representation can be treated as a standalone task, but more commonly, it is learned concurrently with the estimation of the probability model for the problem being solved.

Buy one, get one free!

Let’s consider a word wiw_i that is represented by a vector of dd numbers:

wi=(v1,v2,,vd)w_i = (v_1,v_2,\dots,v_d)

In this representation, v1,v2,,vdv_1, v_2, \dots, v_d are additional learnable parameters that are updated alongside the parameters of the probability model. These embedding parameters are adjusted to enhance the accuracy of predictions. Typically, the parameters are initialized randomly unless a more sophisticated initialization method is employed.

The probability model

The probability model used for predicting the next word needs to possess two essential capabilities:

  • It should take into account all the words in the sequence that have been processed thus far as input.
  • It should generate the probability of each potential next word as output.

While various models can be employed in theory, the most commonly used approach is to utilize recurrent functions. These functions process one word at a time, incorporating information from all previously processed words (history) as input. In doing so, they provide information about all the words processed so far and a probability distribution for the next word.

The initial history for the first word can be arbitrarily assigned or set randomly.

Example

Let’s look at the following code that essentially builds a simple language model that predicts the next word in a sequence of words based on the history of words. This code implements a basic language model for predicting the next word in a sentence. It employs a small vocabulary of words such as “yes,” “no,” and “maybe,” each represented by numerical embeddings. The model uses two crucial matrices, WhW_h and WxW_x, as learnable parameters. These matrices are used to assign weights to the historical context and the current word when making predictions. The probability_model function calculates the probabilities of the next word being “yes”, “no”, or “maybe” based on the input history and the current word. It then demonstrates the model’s usage by iteratively updating the history as it predicts the next word in a given sequence, essentially simulating a simple language prediction task.

import numpy as np
# Vocabulary
vocabulary = {"yes": 1, "no": 2, "maybe": 3}
# Embeddings
embeddings = {
"yes": np.array([0.1, 0.2, 0.3, 0.4, 0.5]),
"no": np.array([0.6, 0.7, 0.8, 0.9, 1.0]),
"maybe": np.array([1.1, 1.2, 1.3, 1.4, 1.5]),
}
# Learnable parameters
Wh = np.random.rand(5, 5)
Wx = np.random.rand(5, 5)
# Probability model function
def probability_model(history, current_word):
# Retrieve the embeddings for history and current word
h_embedding = history
x_embedding = embeddings[current_word]
# Compute the weighted sum of embeddings
weighted_sum = np.dot(Wh, h_embedding) + np.dot(Wx, x_embedding)
history_next = weighted_sum.copy()
weighted_sum = np.sort(weighted_sum)[::-1][:3]
# Apply softmax to obtain probability distribution
probabilities = np.exp(weighted_sum) / np.sum(np.exp(weighted_sum))
return history_next ,probabilities
# Example usage
history = np.array([0.1,0.2,0.1,1.4,2])
seq = ['yes','no','no','yes','maybe']
for current_word in seq:
history, probabilities = probability_model(history, current_word)
print(f'Current word is "{current_word}"')
print(f'Probabilities of next word being["yes", "no", "maybe"] = {probabilities}')

The beauty of recurrent functions is that they can seamlessly process sequences of words of arbitrary lengths without any hindrance. Thus, we can set the length of the word sequence as desired, and the recurrent function will handle it smoothly. Feel free to experiment with the seq list in the provided code. However, it is important to note that the simplicity of the above recurrent function may not adequately capture the complex statistical patterns found in real-world language. To enhance the model’s representational capacity, one approach is to model it as a neural network. This is where the true strength of recurrent neural networks begins to emerge.

When a recurrent function is implemented using a neural network, it is referred to as a recurrent neural network (RNN).

The following code is a template of a recurrent neural network in the context of language modeling. Notice the similarity with the function defined earlier.

def probability_model(history, current_word):
    h = history
    x = embeddings[current_word]

    history_next, probabilities = neural_network(h,x)

    return history_next, probabilities

Generation using RNNs

The model’s parameters are optimized to make accurate predictions, enabling the generation of next-word sequences. This process involves substituting the current word in each subsequent function call with the word that was predicted to be the best choice in the previous call. The figure below provides a visual representation and further explanation of this concept.

RNNs for generating the next word
RNNs for generating the next word

Sampling the next word

The generation of the next word involves sampling from the learned probability distribution. The language model assigns probabilities to each potential next word given the preceding words in the sequence. Sampling can be done using various methods, such as greedy sampling (selecting the word with the highest probability), random selection of the next word, or advanced techniques like temperature-based sampling or beam search, which help control the diversity or quality of the generated text.

When using random or temperature-based sampling, it is possible to obtain different next words for the same history and current word on multiple occasions. Have you had the opportunity to try out ChatGPT yourself?

Transformers

Despite being well-suited for language models, RNNs face two primary challenges. First, they rely on sequential processing through time, and second, they struggle with managing long-term dependencies in historical data.

Transformers, conversely, can process input sequences in parallel, whereas RNNs operate sequentially. This parallelism enables transformers to handle long-range dependencies more effectively and significantly speeds up computation, making them more efficient for training and inference.

While transformers offer significant advantages in practice, particularly in language modeling, it’s worth noting that RNNs still excel in certain scenarios. The choice between transformers and RNNs depends on the specific requirements and characteristics of the problem at hand.

Transformers are designed to transform sequences of input embeddings (x1,...,xn)(\bold{x_1, ..., x_n}) into sequences of transformed embeddings (y1,...,yn)(\bold{y_1, ..., y_n}) with the same length.

Self-attention

The essence of the transformer architecture lies in its utilization of self-attention, which plays a crucial role in the entire process. The concept revolves around modeling the history of the current word in a more advanced manner. Specifically, certain words within the current word’s history may carry greater significance than others when generating the next word. Self-attention achieves this by assigning weights to all the words in the history, ensuring that important words receive higher weights. It is typical to rescale the weights to sum up to 1, and all weights are non-negative.

How self-attention works

Consider a sequence of input embeddings (x1,x2,,xn)(\bold x_1, \bold x_2, \dots, \bold x_n) where each embedding xi\bold x_i has dd components, i.e., xiRd\bold x_i \in \mathbb{R}^d for all ii. Let’s assume that the current word’s embedding is xi\bold x_i, and we aim to transform it to obtain yi\bold y_i. Using parameter matrices Wq,Wk,W_q, W_k, and WvW_v, we transform all the input embeddings up to the current word’s embedding as follows:

qj=Wqxjkj=Wkxjvj=Wvxj\begin{align*} \bold q_j &= W_q\bold x_j \\ \bold k_j &= W_k\bold x_j \\ \bold v_j &= W_v\bold x_j \end{align*}

The weight aija_{ij} can then be defined as follows:

aij=qjTkia_{ij}=\bold q_j^T\bold k_i

Finally, yi\bold y_i can be estimated as a linear combination of the transformed embeddings:

yi=j=1iaijvj\bold y_i = \sum_{j=1}^ia_{ij}\bold v_j

The parameter matrices, Wq,Wk,W_q, W_k, and WvW_v, are learned during the training process, allowing the weight aija_{ij} to be learned as well. Here, aija_{ij} represents the importance or attention given to the jthj^{th} element when processing the ithi^{th} element in the sequence. It quantifies the relevance or contribution of other elements to the current element’s representation, making self-attention a powerful mechanism for capturing dependencies and relationships within a sequence of data, such as in natural language processing tasks.

Forward-pass for attention
Forward-pass for attention

Positional encoding

It is important to note that shuffling the input embeddings would yield the same transformations in self-attention due to the properties of linear combinations. However, this shuffling disregards the sequential nature of the words and fails to utilize their inherent order. To preserve and incorporate sequential information, it is essential to include positional information within each word’s corresponding input embedding. By incorporating positional encoding, the model becomes aware of the relative positions of words in the sequence, enabling it to leverage and utilize the sequential relationships for more accurate processing and understanding of the input.

Training a transformer

The training process of a transformer is fundamentally similar to that of RNNs. However, with the inclusion of self-attention, the key distinction lies in the parallel nature of computations. Unlike RNNs, which process sequential input sequentially, transformers can perform computations in parallel, allowing for more efficient training.

Once trained, generating the next word in a transformer follows a similar approach as in RNNs. The process involves sampling the next word and using it as the current word for the subsequent time step. By iteratively generating words in this manner, the transformer produces a sequence of words that can extend beyond the training data.

Generative Pre-trained Transformer (GPT)

In the context of transformers, generative refers to the ability of the model to generate new content, such as text, based on its understanding of the patterns and structure in the training data. Generative models aim to produce outputs that resemble and extend beyond the data they were trained on.

In the case of transformer models like GPT (Generative Pre-trained Transformer), the term “generative” indicates that the model is capable of generating coherent and contextually relevant text. By leveraging its learned knowledge of language patterns and relationships, a generative transformer can generate sequences of words that are meaningful and resemble human-like language.

Generative transformers have shown impressive capabilities in various natural language processing tasks, including text completion, text generation, machine translation, and more. They have the ability to generate novel and contextually appropriate responses, making them valuable tools in applications such as chatbots, content generation, and creative writing assistance.

The OpenAI GPT series

In recent years, the field of natural language processing (NLP) has witnessed a revolutionary advancement with the emergence of Generative Pre-trained Transformers (GPTs). These models, which combine the power of transformers and generative capabilities, have transformed the landscape of language understanding and generation tasks. In this blog, we will delve into the evolution of GPTs, exploring their remarkable journey from GPT-1 to the cutting-edge models of today.

GPT-1

GPT-1, the first iteration of the GPT series introduced by OpenAI in 2018, set the stage for what was to come. Built upon the transformer architecture, GPT-1 showcased the potential of self-attention mechanisms in capturing contextual dependencies in text data. Despite some precise control and context consistency limitations, GPT-1 demonstrated impressive language generation capabilities, igniting excitement for further advancements.

GPT-2

Building upon the success of GPT-1, OpenAI released GPT-2 in 2019, elevating the capabilities of generative models. GPT-2 boasted a significantly larger model size and was trained on a vast corpus of text data, enabling it to generate high-quality text with remarkable coherence and fluency. The model gained considerable attention due to its ability to generate human-like articles, poetry, and even news reports. However, due to concerns about potential misuse, OpenAI initially refrained from releasing the full model.

GPT-3

The arrival of GPT-3 marked a monumental leap forward in generative language models. Released in 2020, GPT-3 stunned the NLP community and beyond with its unprecedented scale and capabilities. With a staggering 175 billion parameters, GPT-3 achieved remarkable proficiency in language understanding and generation tasks. The model demonstrated exceptional performance in a wide range of applications, including language translation, question answering, text completion, and even creative writing. GPT-3 exhibited the ability to engage in coherent and contextually relevant conversations, showcasing the power of large-scale generative models.

Beyond GPT-3

The evolution of GPTs does not stop at GPT-3. The NLP community is continuing to push the boundaries of generative transformers. Researchers and organizations are actively exploring ways to improve efficiency, fine-tune control, and address ethical concerns. Techniques like prompt engineering, few-shot learning, and model distillation are employed to enhance the GPT models’ capabilities while addressing their limitations.

Conclusion

The evolution of GPTs has revolutionized the field of natural language processing, pushing the limits of language understanding and generation. From the pioneering GPT-1 to the groundbreaking GPT-3, these Generative Pre-trained Transformers have paved the way for incredible advancements. The continuous innovation and exploration in this area promise to unlock even more transformative possibilities, leading us toward a future where generative models play an increasingly significant role in understanding, communicating, and generating human-like text.


You can learn more about natural language processing in our courses:

Using OpenAI API for NLP in Python

Cover
Using OpenAI API for Natural Language Processing in Python

As consumers rely more and more on search engines and technical software programs to answer their questions, the demand for effective and scalable natural language processing has gone immensely up. OpenAI provides access to the GPT model, which can perform several operations for NLP-related tasks such as summarization, classification, text completion, text insertion, and more. In this course, you’ll learn about the various endpoints of the OpenAI API and how they can be used to accomplish certain NLP tasks. You’ll also look at examples of each endpoint to show how they work. By the time you’re done with this course, you’ll be able to work on your own projects using the OpenAI API.

1hr 30mins
Beginner
22 Playgrounds
30 Illustrations

Natural Language Processing with Machine Learning

Cover
Natural Language Processing with Machine Learning

In this course you'll learn techniques for processing text data, creating word embeddings, and using long short-term memory networks (LSTM) for tasks such as semantic analysis and machine translation. After completing this course, you will be able to solve the important day-to-day NLP problems faced in industry, which is incredibly useful given the prevalence of text data. The code for this course is built around the TensorFlow framework, one of the premier frameworks for industry machine learning, and the Python pandas library for data analysis. Knowledge of Python and TensorFlow are prerequisites. This course was created by AdaptiLab, a company specializing in evaluating, sourcing, and upskilling enterprise machine learning talent. It is built in collaboration with industry machine learning experts from Google, Microsoft, Amazon, and Apple.

9hrs
Advanced
33 Challenges
4 Quizzes