...
/Transformer Architecture: Encoder, Decoder, and Computing Output
Transformer Architecture: Encoder, Decoder, and Computing Output
Learn about the encoder and decoder in the transformer architecture.
Transformer architecture
A transformer is a type of seq2seq model. Transformer models can work with both image and text data. The transformer model takes in a sequence of inputs and maps that to a sequence of outputs.
The transformer model was initially proposed in the paper
Let’s see how the transformer model works using the previously studied MT task. The encoder takes in a sequence of source language tokens and produces a sequence of interim outputs. Then, the decoder takes in a sequence of target language tokens and predicts the next token for each time step (the teacher forcing technique). Both the encoder and the decoder use attention mechanisms to improve performance. For example, the decoder uses attention to inspect all the past encoder states and previous decoder inputs. The attention mechanism is conceptually similar to Bahdanau attention.
The encoder and the decoder
Now let’s discuss in detail what the encoder and the decoder consist of. They have more or less the same architecture with a few differences. Both the encoder and the decoder are designed to consume a sequence of input items at a time. But their goals during the task differ; the encoder produces a latent representation with the inputs, whereas the decoder produces a target output with the inputs and the encoder’s outputs. To perform these computations, these inputs are propagated through several stacked layers. Each layer within these models takes in a sequence of elements and outputs another sequence of elements. Each layer is also made from several sublayers that encapsulate different computations performed on a sequence of input tokens to produce a sequence of outputs.
Sublayers of the transformer
A layer found in the transformer mainly comprises the following two sublayers:
A self-attention layer
A fully connected layer
The self-attention layer produces its output using matrix multiplications and activation functions (this is similar to a fully connected layer). The self-attention layer takes in a sequence of inputs and produces a sequence of outputs. However, a special characteristic of the self-attention layer is that when producing an output at each time step, it has access to all the other inputs in that sequence. This makes learning and remembering long sequences of inputs trivial for this layer. For comparison, RNNs struggle to remember long sequences of inputs because they need to go through each input sequentially. Additionally, by design, the self-attention layer can select and combine different inputs at each time step based on the task it’s solving. This makes transformers very powerful in sequential learning tasks.
Let’s discuss why it’s important to selectively combine different input elements this way. In an NLP context, the self-attention layer enables the model to peek at other words while processing a certain word. This means that while the encoder is processing the word “it” in the sentence “I kicked the ball and it disappeared,” the model can attend to the word “ball.” By doing this, the transformer can learn dependencies and disambiguate words, which leads to better language understanding.
We can even understand how self-attention helps us to solve a task conveniently through a real-world example. Assume we’re playing a game with two other people: person A and person B. Person A holds a ...