The Last Mile: Decoding Results

Learn about transformers, which are composed of encoders and decoders, with a focus on the decoder's unique role in decoding input to the original format.

We'll cover the following

As mentioned earlier, the transformers are made of two components:

  • Encoder

  • Decoder

Even though they share the core elements of positional encoding, self-attention, and feedforward layers, the decoder still has to perform an additional operation: decoding the input to its original data format. This operation is done by a linear layer (a feedforward network that adapts the dimension of the input to the dimension of the output) and a softmax function (it transforms the input into a vector of probabilities).

From that vector, we pick the word corresponding to the highest probability and use it as the best output of the model.

GPT-3

Now, we will discuss GPT-3, the architecture behind ChatGPT. It is a model based on a transformer architecture, yet with a peculiarity: it only has the decoder layer. This is why, in their introductory paper, “Improving Language Understanding by Generative Pre-Training,” OpenAI researchers used an only-decoder approach.

GPT-3 is huge. But how huge, concretely?

Let’s start with the knowledge base it was trained on. It was meant to be as exhaustive as possible in
terms of human knowledge, so it was composed of different sources:

  • Common Crawl: This is a massive corpus of web data gathered over an eight-year period with minimal filtering.

  • OpenWebText2: This is a collection of text from web pages linked to Reddit posts with at least three upvotes.

  • Books1 and Books2: These are two separate corpora consisting of books available on the internet.

  • Wikipedia: This is a corpus containing articles from the English-language version of the popular online encyclopedia, Wikipedia.

Here we can get a better idea:

Get hands-on with 1200+ tech skills courses.