The Transformer's Decoder

Formulate the transformer's decoder and learn about masked multi-head self-attention.

The decoder consists of all the aforementioned components plus two novel ones. As before:

  1. The output sequence is fed in its entirety, and word embeddings are computed.

  2. Positional encoding is again applied.

  3. The vectors are passed to the first decoder block.

Each decoder block includes:

  1. A masked multi-head self-attention layer

  2. A normalization layer followed by a residual connection

  3. A new multi-head attention layer (known as encoder-decoder attention)

  4. A second normalization layer and a residual connection

  5. A linear layer and a third residual connection

The decoder block appears again N=6N=6 repeated times. The final output is transformed through a final linear layer, and the output probabilities are calculated with the standard softmax function.

The output probabilities predict the next token in the output sentence.

How?

For the machine translation example, we assign a probability to each word in the French language, and then, we simply keep the one with the highest score.

To put things into perspective, the original model was trained on a 2014 English-French dataset consisting of 36 M sentences and 32,000 tokens.

While you are familiar with most concepts of the decoder, there are two more that we need to discuss. Let’s start with the masked multi-head self-attention layer.

Masked multi-head self-attention

In case you haven’t realized, in the decoding stage, we predict one word (token) after another. In NLP problems like machine translation, the sequential token prediction is unavoidable. As a result, the self-attention layer needs to be modified in order to consider only the output sentence that has been generated so far.

In our translation example “Hello I love you”, the input of the decoder on the third pass will be “Bonjour”, “je … …”.

As you can tell, the difference here is that we don’t know the whole sentence because it hasn’t been produced yet. That is why we need to disregard the unknown words. Otherwise, the model would just copy the next word! To achieve this, we mask the next word embeddings (by setting them to ...