Home/Blog/Generative Ai/The evolution of GPTs
Home/Blog/Generative Ai/The evolution of GPTs

The evolution of GPTs

17 min read
Nov 15, 2023
content
How machines understand words
Language models
Training language models
Language models and LLMs: Scaling intelligence
Text generation
But wait a minute!
Word embeddings
Buy one, get one free!
The probability model
Example
Generation using RNNs
Sampling the next word
Transformers
Self-attention
How self-attention works
Positional encoding
Training a transformer
The OpenAI GPT series
Generative Pre-trained Transformer (GPT)
GPT-1
GPT-2
GPT-3
Future of GPT models
Conclusion

Key takeaways:

  1. Language models are designed to understand and generate human-like text based on the patterns they learn from vast datasets.

  2. Training language models involves processing large amounts of text data to learn word relationships and context.

  3. Text generation refers to the ability of models to produce coherent and contextually relevant text based on input prompts.

  4. GPT models are a type of transformer specializing in generating human-like text through extensive pre-training on diverse text data.

  5. GPT-1 introduced the concept of a generative pre-trained transformer, laying the groundwork for future advancements in language modeling.

  6. GPT-2 expanded on its predecessor by significantly increasing the model size and demonstrating impressive text generation abilities.

  7. GPT-3 further pushed the boundaries with 175 billion parameters, enabling it to generate even more coherent and contextually rich text.

The buzz surrounding GPT is everywhere—on social media, in classrooms, and in boardrooms. Everyone is talking about its remarkable capabilities. But what’s all the excitement about?

This blog is an attempt to unfold the historical developments within computational linguistics that helped give birth to ChatGPT. Like us humans, do computers also need to know what words mean?

How machines understand words#

Fluent speakers possess vast knowledge, primarily reflected in their vocabulary. This knowledge includes the grammatical function, meaning, real-world reference, and pragmatic use of words. While estimates of adult vocabulary size vary, it is agreed that most words used by mature speakers are acquired early in life through spoken interactions. This active vocabulary remains limited compared to the adult vocabulary, leaving many words to be acquired through other means. Children achieve remarkable vocabulary growth rates by learning approximately 7 to 10 words daily, with reading playing a significant role in this process. A renowned principle within linguistics, known as the distributional hypothesis, suggests that word meanings can be learned from text alone, based on the associations between words and their co-occurring words. This happens because synonymous words often appear in similar contexts or alongside similar words in written text. A class of machine learning models called language models has proven effective in capturing this type of knowledge from vast amounts of text. These language models can be utilized for various natural language processing tasks, including sentiment analysis, language translations, text summarization, chatbot Q&A, text generation, etc.

This blog will center around the basics of language models that use a specific architecture and provide a foundation for understanding the much-hyped GPT (Generative Pre-trained Transformer). Subsequently, we will explore the evolution of GPT models as it happened over the years.

Language models#

In the context of natural language processing (NLP), a language model is a computational model designed to understand and generate human language. It is trained on a large corpus of text data and learns the language’s statistical patterns, relationships, and structures. The main goal of a language model is to predict the probability of the next word in a sequence of words, given the previous words. In particular, if w1,w2,,wnw_1,w_2,\dots,w_n is a sequence of nn, words, then probability of the next world, wn+1w_{n+1}, using Bayes rule, can be defined as follows:

P(wn+1wn,wn1,,w1)=P(wn+1,wn,wn1,,w1)P(wn,wn1,,w1)P(w_{n+1}|w_n,w_{n-1},\dots,w_1)=\frac{P(w_{n+1},w_n,w_{n-1},\dots,w_1)}{P(w_n,w_{n-1},\dots,w_1)}

Training language models#

Generating the next word based on a sequence of words becomes a matter of sampling from the probability distribution once the language model is accessible.

However, the initial challenge lies in estimating the probability model itself. Constructing such a language model necessitates an adequate amount of training data, which, fortunately, is readily available nowadays.

Essentially, n-grams are chunks of words in a row, where ‘n’ represents the number of words in each chunk. Imagine you have a sentence like “I love ice cream.” If we break this into 2-grams (also known as bigrams), we get the following pairs of words: “I love,” “love ice,” and “ice cream.” For 3-grams (trigrams), we have groups of three words: “I love ice” and “love ice cream.” The provided code snippet demonstrates the implementation of the n-gram approach.

# Creating a function to generate n-grams
def make_training_samples(text, context_length):
words = text.split()
output = []
# Iterate through the text to generate n-grams
for i in range(len(words) - context_length + 1):
output.append(words[i:i + context_length])
return output
# Calling the function
text = 'How old are you right now ?'
# Generate training samples for different context lengths
for context_length in range(2, len(text.split()) + 1):
print(f'Context Length = {context_length - 1}')
# Generate training samples using the make_training_samples function
samples = make_training_samples(text, context_length=context_length)
# Print each training sample
for sample in samples:
print(sample)

After preparing the training data from a substantial corpus of text, the next step is to estimate the probability model, which is referred to as language model training. This process involves training the language model on the prepared dataset, allowing it to learn the statistical patterns and relationships in the language.

Language models and LLMs: Scaling intelligence#

Language models, like the ones discussed earlier, have laid the foundation for natural language processing tasks. But as data grew, so did the need for more powerful models capable of understanding context at a much deeper level. This is where large language models (LLMs) come in.

LLMs are essentially language models, but at a massive scale—trained on billions, even trillions, of words. Their sheer size enables them to capture subtle nuances in language, making them highly effective at generating coherent, meaningful text across various tasks. Whether summarizing complex documents, engaging in conversation, or answering intricate questions, LLMs have transformed AI’s ability to process and generate human-like text.

By leveraging their vast knowledge, LLMs are the next evolution of language models, pushing the boundaries of what AI can achieve in understanding and communication. GPT is a type of LLM.

Unlock the incredible potential of large language models in these exciting courses that will transform your understanding!

Text generation#

Once the language model is trained in this way on a text corpus, it gains the ability to sample from the learned probability distribution. This sampling process enables the language model to generate coherent and contextually relevant text, as it can draw upon its learned knowledge of language patterns and structures.

But wait a minute!#

Before delving further into the details of how language models generate text, some questions need to be asked:

  • How is a word represented in a language model?
  • How is the estimation of probabilities conducted using prepared training data?
  • Which models are used that are insensitive to the length of the sequence?
  • How is the generation of the next word accomplished, and what does the sampling process entail?

Let’s address these questions individually and provide a comprehensive explanation for each one.

Word embeddings#

In computational models, words are typically represented using numerical vectors or embeddings, instead of strings or characters. These embeddings capture the semantic and syntactic properties of words, allowing them to be processed mathematically.

Learning the optimal word embedding representation can be treated as a standalone task, but more commonly, it is learned concurrently with the estimation of the probability model for the problem being solved.

Fun fact: Word2Vec is a popular word embedding technique developed by Google.

Buy one, get one free!#

Let’s consider a word wiw_i that is represented by a vector of dd numbers:

wi=(v1,v2,,vd)w_i = (v_1,v_2,\dots,v_d)

In this representation, v1,v2,,vdv_1, v_2, \dots, v_d are additional learnable parameters that are updated alongside the parameters of the probability model. These embedding parameters are adjusted to enhance the accuracy of predictions. Typically, the parameters are initialized randomly unless a more sophisticated initialization method is employed.

The probability model#

The probability model used for predicting the next word needs to possess two essential capabilities:

  • It should take into account all the words in the sequence that have been processed thus far as input.
  • It should generate the probability of each potential next word as output.

While various models can be employed in theory, the most commonly used approach is to utilize recurrent functions. These functions process one word at a time, incorporating information from all previously processed words (history) as input. In doing so, they provide information about all the words processed so far and a probability distribution for the next word.

The initial history for the first word can be arbitrarily assigned or set randomly.

Example#

Let’s look at the following code that essentially builds a simple language model that predicts the next word in a sequence of words based on the history of words. This code implements a basic language model for predicting the next word in a sentence. It employs a small vocabulary of words such as “yes,” “no,” and “maybe,” each represented by numerical embeddings. The model uses two crucial matrices, WhW_h and WxW_x, as learnable parameters. These matrices are used to assign weights to the historical context and the current word when making predictions. The probability_model function calculates the probabilities of the next word being “yes”, “no”, or “maybe” based on the input history and the current word. It then demonstrates the model’s usage by iteratively updating the history as it predicts the next word in a given sequence, essentially simulating a simple language prediction task.

import numpy as np
# Vocabulary
vocabulary = {"yes": 1, "no": 2, "maybe": 3}
# Embeddings
embeddings = {
"yes": np.array([0.1, 0.2, 0.3, 0.4, 0.5]),
"no": np.array([0.6, 0.7, 0.8, 0.9, 1.0]),
"maybe": np.array([1.1, 1.2, 1.3, 1.4, 1.5]),
}
# Learnable parameters
Wh = np.random.rand(5, 5)
Wx = np.random.rand(5, 5)
# Probability model function
def probability_model(history, current_word):
# Retrieve the embeddings for history and current word
h_embedding = history
x_embedding = embeddings[current_word]
# Compute the weighted sum of embeddings
weighted_sum = np.dot(Wh, h_embedding) + np.dot(Wx, x_embedding)
history_next = weighted_sum.copy()
weighted_sum = np.sort(weighted_sum)[::-1][:3]
# Apply softmax to obtain probability distribution
probabilities = np.exp(weighted_sum) / np.sum(np.exp(weighted_sum))
return history_next ,probabilities
# Example usage
history = np.array([0.1,0.2,0.1,1.4,2])
seq = ['yes','no','no','yes','maybe']
for current_word in seq:
history, probabilities = probability_model(history, current_word)
print(f'Current word is "{current_word}"')
print(f'Probabilities of next word being["yes", "no", "maybe"] = {probabilities}')

The beauty of recurrent functions is that they can seamlessly process sequences of words of arbitrary lengths without any hindrance. Thus, we can set the length of the word sequence as desired, and the recurrent function will handle it smoothly. Feel free to experiment with the seq list in the provided code. However, it is important to note that the simplicity of the above recurrent function may not adequately capture the complex statistical patterns found in real-world language. To enhance the model’s representational capacity, one approach is to model it as a neural network. This is where the true strength of recurrent neural networks begins to emerge.

When a recurrent function is implemented using a neural network, it is referred to as a recurrent neural network (RNN).

The following code is a template of a recurrent neural network in the context of language modeling. Notice the similarity with the function defined earlier.

def probability_model(history, current_word):
    h = history
    x = embeddings[current_word]

    history_next, probabilities = neural_network(h,x)

    return history_next, probabilities

Generation using RNNs#

The model’s parameters are optimized to make accurate predictions, enabling the generation of next-word sequences. This process involves substituting the current word in each subsequent function call with the word that was predicted to be the best choice in the previous call. The figure below provides a visual representation and further explanation of this concept.

RNNs for generating the next word
RNNs for generating the next word

Sampling the next word#

The generation of the next word involves sampling from the learned probability distribution. The language model assigns probabilities to each potential next word given the preceding words in the sequence. Sampling can be done using various methods, such as greedy sampling (selecting the word with the highest probability), random selection of the next word, or advanced techniques like temperature-based sampling or beam search, which help control the diversity or quality of the generated text.

When using random or temperature-based sampling, it is possible to obtain different next words for the same history and current word on multiple occasions. Have you had the opportunity to try out ChatGPT yourself?

Transformers#

Despite being well-suited for language models, RNNs face two primary challenges. First, they rely on sequential processing through time, and second, they struggle with managing long-term dependencies in historical data.

Transformers, conversely, can process input sequences in parallel, whereas RNNs operate sequentially. This parallelism enables transformers to handle long-range dependencies more effectively and significantly speeds up computation, making them more efficient for training and inference.

While transformers offer significant advantages in practice, particularly in language modeling, it’s worth noting that RNNs still excel in certain scenarios. The choice between transformers and RNNs depends on the specific requirements and characteristics of the problem at hand.

Transformers are designed to transform sequences of input embeddings (x1,...,xn)(\bold{x_1, ..., x_n}) into sequences of transformed embeddings (y1,...,yn)(\bold{y_1, ..., y_n}) with the same length.

Self-attention#

The essence of the transformer architecture lies in its utilization of self-attention, which plays a crucial role in the entire process. The concept revolves around modeling the history of the current word in a more advanced manner. Specifically, certain words within the current word’s history may carry greater significance than others when generating the next word. Self-attention achieves this by assigning weights to all the words in the history, ensuring that important words receive higher weights. It is typical to rescale the weights to sum up to 1, and all weights are non-negative.

How self-attention works#

Consider a sequence of input embeddings (x1,x2,,xn)(\bold x_1, \bold x_2, \dots, \bold x_n) where each embedding xi\bold x_i has dd components, i.e., xiRd\bold x_i \in \mathbb{R}^d for all ii. Let’s assume that the current word’s embedding is xi\bold x_i, and we aim to transform it to obtain yi\bold y_i. Using parameter matrices Wq,Wk,W_q, W_k, and WvW_v, we transform all the input embeddings up to the current word’s embedding as follows:

qj=Wqxjkj=Wkxjvj=Wvxj\begin{align*} \bold q_j &= W_q\bold x_j \\ \bold k_j &= W_k\bold x_j \\ \bold v_j &= W_v\bold x_j \end{align*}

The weight aija_{ij} can then be defined as follows:

aij=qjTkia_{ij}=\bold q_j^T\bold k_i

Finally, yi\bold y_i can be estimated as a linear combination of the transformed embeddings:

yi=j=1iaijvj\bold y_i = \sum_{j=1}^ia_{ij}\bold v_j

The parameter matrices, Wq,Wk,W_q, W_k, and WvW_v, are learned during the training process, allowing the weight aija_{ij} to be learned as well. Here, aija_{ij} represents the importance or attention given to the jthj^{th} element when processing the ithi^{th} element in the sequence. It quantifies the relevance or contribution of other elements to the current element’s representation, making self-attention a powerful mechanism for capturing dependencies and relationships within a sequence of data, such as in natural language processing tasks.

Forward-pass for attention
Forward-pass for attention

Positional encoding#

It is important to note that shuffling the input embeddings would yield the same transformations in self-attention due to the properties of linear combinations. However, this shuffling disregards the sequential nature of the words and fails to utilize their inherent order. To preserve and incorporate sequential information, it is essential to include positional information within each word’s corresponding input embedding. By incorporating positional encoding, the model becomes aware of the relative positions of words in the sequence, enabling it to leverage and utilize the sequential relationships for more accurate processing and understanding of the input.

Curious about how Hugging Face transformers perform text summarization? Dive into this fascinating blog series!

1.Text Summarization With Hugging Face Transformers: Part 1

2.Text Summarization With Hugging Face Transformers: Part 2

3.Text Summarization With Hugging Face Transformers: Part 3

Training a transformer#

The training process of a transformer is fundamentally similar to that of RNNs. However, with the inclusion of self-attention, the key distinction lies in the parallel nature of computations. Unlike RNNs, which process sequential input sequentially, transformers can perform computations in parallel, allowing for more efficient training.

Once trained, generating the next word in a transformer follows a similar approach as in RNNs. The process involves sampling the next word and using it as the current word for the subsequent time step. By iteratively generating words in this manner, the transformer produces a sequence of words that can extend beyond the training data.

Want to dive deeper into the details of transformers? Explore these amazing courses!

The OpenAI GPT series#

In recent years, the field of natural language processing (NLP) has witnessed a revolutionary advancement with the emergence of Generative Pre-trained Transformers (GPTs). These models, which combine the power of transformers and generative capabilities, have transformed the landscape of language understanding and generation tasks. In this blog, we will delve into the evolution of GPTs, exploring their remarkable journey from GPT-1 to the cutting-edge models of today.

Generative Pre-trained Transformer (GPT)#

In the context of transformers, generative refers to the ability of the model to generate new content, such as text, based on its understanding of the patterns and structure in the training data. Generative models aim to produce outputs that resemble and extend beyond the data they were trained on.

In the case of transformer models like GPT (Generative Pre-trained Transformer), the term “generative” indicates that the model is capable of generating coherent and contextually relevant text. By leveraging its learned knowledge of language patterns and relationships, a generative transformer can generate sequences of words that are meaningful and resemble human-like language.

Generative transformers have shown impressive capabilities in various natural language processing tasks, including text completion, text generation, machine translation, and more. They have the ability to generate novel and contextually appropriate responses, making them valuable tools in applications such as chatbots, content generation, and creative writing assistance.


GPT-1#

GPT-1, the first iteration of the GPT series introduced by OpenAI in 2018, set the stage for what was to come. Built upon the transformer architecture, GPT-1 showcased the potential of self-attention mechanisms in capturing contextual dependencies in text data. Despite some precise control and context consistency limitations, GPT-1 demonstrated impressive language generation capabilities, igniting excitement for further advancements.

GPT-2#

Building upon the success of GPT-1, OpenAI released GPT-2 in 2019, elevating the capabilities of generative models. GPT-2 boasted a significantly larger model size and was trained on a vast corpus of text data, enabling it to generate high-quality text with remarkable coherence and fluency. The model gained considerable attention due to its ability to generate human-like articles, poetry, and even news reports. However, due to concerns about potential misuse, OpenAI initially refrained from releasing the full model.

GPT-3#

The arrival of GPT-3 marked a monumental leap forward in generative language models. Released in 2020, GPT-3 stunned the NLP community and beyond with its unprecedented scale and capabilities. With a staggering 175 billion parameters, GPT-3 achieved remarkable proficiency in language understanding and generation tasks. The model demonstrated exceptional performance in a wide range of applications, including language translation, question answering, text completion, and even creative writing. GPT-3 exhibited the ability to engage in coherent and contextually relevant conversations, showcasing the power of large-scale generative models.

Future of GPT models#

The evolution of GPTs does not stop at GPT-3. The NLP community is continuing to push the boundaries of generative transformers. Researchers and organizations are actively exploring ways to improve efficiency, fine-tune control, and address ethical concerns. Techniques like prompt engineering, few-shot learning, and model distillation are employed to enhance the GPT models’ capabilities while addressing their limitations.

Fun fact: GPT-4o can generate different creative text formats, such as poems, code, scripts, musical pieces, emails, letters, etc. This versatility showcases its ability to understand and apply various language styles and structures. Here is a poem generated by GPT-4o when given a prompt: generate a poem on e-learning.

In pixels bright, where knowledge flows,
A world of learning gently grows.
No classroom walls, no rigid space,
Just open minds in virtual grace.

With clicks and keys, we soar so high,
Across the globe, beneath one sky.
From science deep to art’s pure gleam,
E-learning fuels each daring dream.

No waiting bell, no crowded hall,

Just knowledge ready, free for all.

A course, a screen, a curious mind.

In digital worlds, new paths we find.

So here we sit, with lessons clear.

In every byte, the future’s near.

For e-learning's not just a trend.

But how all journeys now transcend.

Check out the amazing courses below to learn more:

Conclusion#

The evolution of GPTs has revolutionized the field of natural language processing, pushing the limits of language understanding and generation. From the pioneering GPT-1 to the groundbreaking GPT-3, these Generative Pre-trained Transformers have paved the way for incredible advancements. The continuous innovation and exploration in this area promise to unlock even more transformative possibilities, leading us toward a future where generative models play an increasingly significant role in understanding, communicating, and generating human-like text.

Frequently Asked Questions

What is the evolution of GPT-3 to GPT-4?

GPT-4 improved upon GPT-3 by enhancing understanding of context, reducing biases, and supporting multimodal inputs (text and images). It offers more coherent and nuanced responses and has a larger training dataset.

What is the difference between GPT-4 and GPT-5?

How was the GPT model trained?

Who owns ChatGPT?

What can GPT-4 not do?

Why is GPT-4 bad at math?

Are Gemini and GPT the same?


Written By:
Join 2.5 million developers at
Explore the catalog

Free Resources