The Generative Pre-Trained Transformer: GPT-3
Learn about training generative pre-trained transformers.
The name GPT-3 stands for “Generative Pre-trained Transformer 3.” Let’s go through all these terms individually to understand the making of GPT-3.
Generative models
GPT-3 is a generative model because it generates text. Generative modeling is a branch of statistical modeling. It is a method for mathematically approximating the world. We are surrounded by an incredible amount of easily accessible information—both in the physical and digital worlds. The tricky part is to develop intelligent models and algorithms that can analyze and understand this treasure trove of data. Generative models are one of the most promising approaches to achieving this goal.
To train a model, we must prepare and preprocess a dataset, a collection of examples that helps the model learn to perform a given task. Usually, a dataset is a large amount of data in some specific domain, like using millions of images of cars to teach a model what a car is. Datasets can also take the form of sentences or audio samples. Once we have shown the model many examples, we must train it to generate similar data.
Pre-trained models
Have you heard of the theory of 10,000 hours? In his book Outliers, Malcolm Gladwell suggested that practicing any skill for 10,000 hours is sufficient to make you an expert. This expert knowledge is reflected in the connections our human brains develop between their neurons. An AI model does something similar.
Training
To create a model that performs well, we need to train it using a specific set of variables called parameters. The process of determining the ideal parameters for our model is called training. The model assimilates parameter values through successive training iterations.
A deep learning model takes a lot of time to find these ideal parameters. Training is a lengthy process that, depending on the task, can last from a few hours to a few months and requires tremendous computing power. Reusing some of that long learning process for other tasks would significantly help. And this is where the pre-trained models come in.
A pre-trained model, keeping with Gladwell’s 10,000 hours theory, is the first skill we develop to help us acquire another faster. For example, mastering the craft of solving math problems can allow us to acquire the skill of solving engineering problems faster. A pre-trained model is trained (by us or someone else) for a more general task and can be fine-tuned for different tasks. Instead of creating a brand new model to address our issue, we can use a pre-trained model that has already been trained on a more general problem. The pre-trained model can be fine-tuned to address our specific needs by providing additional training with a tailored dataset. This approach is faster and more efficient and allows for improved performance compared to building a model from scratch.
Training dataset
In machine learning, a model is trained on a dataset. The size and type of data samples vary depending on the task we want to solve. GPT-3 is pre-trained on a corpus of text from five datasets: Common Crawl, WebText2, Books1, Books2, and Wikipedia.
Common crawl
The Common Crawl corpus comprises petabytes of data, including raw web page data, metadata, and text data collected over eight years of web crawling. OpenAI researchers use a curated, filtered version of this dataset.
WebText2
WebText2 is an expanded version of the WebText dataset, an internal OpenAI corpus created by scraping particularly high-quality web pages. To vet for quality, the authors scraped all outbound links from Reddit, which received at least three karma (an indicator for whether other users found the link interesting, educational, or just funny). WebText contains 40 gigabytes of text from these 45 million links and over 8 million documents.
Books1 and books2
Books1 and Books2 are two corpora, or collections of text, that contain the text of tens of thousands of books on various subjects.
Wikipedia
A collection including all English-language articles from the crowdsourced online encyclopedia Wikipedia at the time of finalizing the GPT-3 dataset in 2019. The dataset used has roughly 5.8 million English articles.
This corpus includes nearly a trillion words altogether.
Languages in datasets
GPT-3 is capable of generating and successfully working with languages other than English as well. The table below shows the dataset’s top 10 other languages.
Documents for 10 languages in the GPT-3 dataset
Language | Number of documents |
English | 235,987,420 |
German | 3,014,597 |
French | 2,568,341 |
Portuguese | 1,608,428 |
Italian | 1,456,350 |
Spanish | 1,284,045 |
Dutch | 934,788 |
Polish | 632,959 |
Japanese | 619,582 |
Danish | 396,477 |
While the gap between English and other languages is dramatic—English is number one, with 93% of the dataset; German, at number two, accounts for just 1%—that 1% is sufficient to create perfect text in German, with style transfer and other tasks. The same goes for other languages on the list.
Since GPT-3 is pre-trained on an extensive and diverse corpus of text, it can successfully perform a surprising number of NLP tasks without users providing any additional example data.
Transformer models
Neural networks are at the heart of deep learning, with their name and structure being inspired by the human brain. They are composed of a network or circuit of neurons that work together. Advances in neural networks can enhance the performance of AI models on various tasks, leading AI scientists to continually develop new architectures for these networks. One such advancement is the transformer, a machine learning model that processes a sequence of text all at once rather than one word at a time and has a strong ability to understand the relationship between those words. This invention has dramatically impacted the field of natural language processing. Here is the architecture of the transformer-based Seq2Seq model:
Sequence-to-sequence models
Researchers at Google and the University of Toronto introduced a transformer model in a 2017 paper:
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality, more parallelizable, and require significantly less time to train.
The foundation of transformer models is sequence-to-sequence architecture. Sequence-to-sequence (Seq2Seq) models are useful for converting a sequence of elements, such as words in a sentence, into another sequence, such as a sentence in a different language. This is particularly effective in translation tasks, where a sequence of words in one language is translated into a sequence of words in another language. Google Translate started using a Seq2Seq-based model in 2016.
Seq2Seq models are comprised of two components: an encoder and a decoder. The Encoder can be thought of as a translator who speaks French as their first language and Korean as their second language. The Decoder is a translator who speaks English as their first language and Korean as their second language. To translate French to English, the Encoder converts the French sentence into Korean (also known as the context) and passes it on to the Decoder. Since the Decoder understands Korean, it can translate the sentence from Korean to English. The Encoder and Decoder can successfully translate from French to English, as illustrated above.
Transformer attention mechanism
Transformer architecture was invented to improve AIs’ performance on machine translation tasks. “Transformers started as language models,” Kilcher explains. “Not even that large, but then they became large.”
To use transformer models effectively, it is crucial to grasp the concept of attention to use transformer models effectively. Attention mechanisms mimic how the human brain focuses on specific parts of an input sequence, using probabilities to determine which parts of the sequence are most relevant at each step.
For example, look at the sentence, “The cat sat on the mat once it ate the mouse.” Does “it” in this sentence refer to “the cat” or “the mat?” The transformer model can strongly connect “it” with “the cat.” That’s attention.
An example of how the Encoder and Decoder work together is when the Encoder writes down important keywords related to the meaning of the sentence and provides them to the Decoder along with the translation. These keywords make it easier for the Decoder to understand the translation because it now has a better understanding of the critical parts of the sentence and the terms that provide context.
Types of attention
The transformer model has two types of attention: self-attention (the connection of words within a sentence) and Encoder-Decoder attention (the connection between words from the source sentence to words from the target sentence).
The attention mechanism helps the transformer filter out the noise and focus on what’s relevant: connecting two words in semantic relationship to each other that do not carry any apparent markers pointing to one another.
Transformer models benefit from larger architectures and larger quantities of data. Training on large datasets and fine-tuning for specific tasks improve results. Transformers better understand the context of words in a sentence than any other kind of neural network. GPT is just the Decoder part of the transformer.
Test your understanding
In the sentence, “The man did not cross the street because it was too full,” what does the instance of the word “it” most relate to?
Car
In the sentence, “The car needs fuel if it wants to go that far,” what does the instance of the word “it” most relate to?
Street
In the sentence, “The street is closed by the municipal corporation due to the security incidents it is facing,” what does the instance of the word “it” most relate to?
Man
In the sentence, “The goose did not go to the street because it was too scared,” what does the instance of the word “it” most relate to?
Municipal corporation
Goose