Generative AI with Python and TensorFlow 2/

...

GPT Models

Learn about the GPT model and its successors.

We'll cover the following...

OpenAI is an AI research group that has been in the spotlight for quite some time because of its newsworthy works, such as GPT, GPT-2, and the recently released GPT-3.

Generative pretraining

In this section, we will discuss these architectures and their novel contributions briefly. Toward the end, we'll use a pretrained version of GPT-2 for our text generation task.

GPT

The first model in this series is called GPT, or Generative Pre-trained Transformer. It was released in 2018, about the same time as the BERT model. The paperRadford, Alec. 2018. “Improving Language Understanding with Unsupervised Learning.” OpenAI. OpenAI. June 11, 2018. https://openai.com/blog/language-unsupervised/. presents a task-agnostic architecture based on the ideas of transformers and unsupervised learning. The GPT model was shown to beat several benchmarks, such as GLUE and SST-2, though the performance was overtaken by BERT, which was released shortly after this.

GPT is essentially a language model based on the transformer-decoder model, we presented in the previous chapter (see the lesson on Transformers). Since a language model can be trained in an unsupervised fashion, the authors of this model used this unsupervised approach to train on a very large corpus and then fine-tuned it for specific tasks. The authors used the BookCorpus datasetZhu, Yukun, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. “Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books.” ArXiv:1506.06724 [Cs], June. https://arxiv.org/abs/1506.06724., which contains over 7,000 unique, unpublished books across different genres. This dataset, the authors claim, allows the model to learn long-range information due to the presence of long stretches of contiguous text. This is seen to be better than the 1B Word Benchmark dataset used by earlier works, which misses out on long-range information due to shuffled sentences. The overall GPT setup is depicted in the following figure:

As shown in figure above (a), the GPT model is similar to the original transformer-decoder. The authors use 12 decoder blocks (as opposed to 6 in the original transformer) with 768-dimensional states and 12 self-attention heads each. Since the model uses masked self-attention, it maintains the causal nature of the language model and, hence, can be used for text generation as well. For the rest of the tasks showcased in the figure above (b), essentially, the same pretrained language model is used with minimal task-specific preprocessing of inputs and final task-specific layers/objectives.

GPT-2

GPT was superseded by an even more powerful model called GPT-2. Radford et al. presented the GPT-2 model as part of their work titled “Language Models are Unsupervised Multi-task LearnersRadford, Alec, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. “Language Models Are Unsupervised Multitask Learners.” https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf.” in 2019. The largest GPT-2 variant is a huge 1.5 billion parameter transformer-based model that was able to perform remarkably well on various NLP tasks.

The most striking aspect of this work is that the authors showcase how a model trained in an unsupervised fashion (language modeling) achieves state-of-the-art ...