Natural Language Processing with TensorFlow/

...

Transformer Architecture: Encoder, Decoder, and Computing Output

Learn about the encoder and decoder in the transformer architecture.

We'll cover the following...

Transformer architecture
The encoder and the decoder
Computing the output of the self-attention layer
- How to calculate a computation

Press + to interact

Let’s see how the transformer model works using the previously studied MT task. The encoder takes in a sequence of source language tokens and produces a sequence of interim outputs. Then, the decoder takes in a sequence of target language tokens and predicts the next token for each time step (the teacher forcing technique). Both the encoder and the decoder use attention mechanisms to improve performance. For example, the decoder uses attention to inspect all the past encoder states and previous decoder inputs. The attention mechanism is conceptually similar to Bahdanau attention.

The encoder and the decoder

Now let’s discuss in detail what the encoder and the decoder consist of. They have more or less the same architecture with a few differences. Both the encoder and the decoder are designed to consume a sequence of input items at a time. But their goals during the task differ; the encoder produces a latent representation with the inputs, whereas the decoder produces a target output with the inputs and the encoder’s outputs. To perform these computations, these inputs are propagated through several stacked layers. Each layer within these models takes in a sequence of elements and outputs another sequence of elements. Each layer is also made from several sublayers that encapsulate different computations performed on a sequence of input tokens to produce a sequence of outputs.

Sublayers of the transformer

A layer found in the transformer mainly comprises the following two sublayers:

A self-attention layer
A fully connected layer

The self-attention layer produces its output using matrix multiplications and activation functions (this is similar to a fully connected layer). The self-attention layer takes in a sequence of inputs and produces a sequence of outputs. However, a special characteristic of the self-attention layer is that when producing an output at each time step, it has access to all the other inputs in that sequence. This makes learning and remembering long sequences of inputs trivial for this layer. For comparison, RNNs struggle to remember long sequences of inputs because they need to go through each input sequentially. Additionally, by design, the self-attention layer can select and combine different inputs at each time step based on the task it’s solving. This makes transformers very powerful in sequential learning tasks.

Let’s discuss why it’s important to selectively combine different input elements this way. In an NLP context, the self-attention layer enables the model to peek at other words while processing a certain word. This means that while the encoder is processing the word “it” in the sentence “I kicked the ball and it disappeared,” the model can attend to the word “ball.” By doing this, the transformer can learn dependencies and disambiguate words, which leads to better language understanding.

We can even understand how self-attention helps us to solve a task conveniently through a real-world example. Assume we’re playing a game with two other people: person A and person B. Person A holds a question written on a board, and we need to answer that question. Say person A reveals one word of the question at a time, and after the last word of the question is revealed, we answer it. For long and complex questions, this would be challenging because we ...

Introduction to Natural Language Processing

Understanding TensorFlow 2

Word2vec: Learning Word Embeddings

Advanced Word Vector Algorithms

Sentence Classification with Convolutional Neural Networks

Recurrent Neural Networks

Understanding Long Short-Term Memory Networks

Applications of LSTM: Generating Text

Sequence-to-Sequence Learning: Neural Machine Translation

Transformers

Sarcasm Classification Using BERT

Image Captioning with Transformers

Caption Generation Using PyTorch

Final Remarks

Appendix: Mathematical Foundations and Advanced TensorFlow

Transformer Architecture: Encoder, Decoder, and Computing Output

Transformer architecture

The encoder and the decoder

Sublayers of the transformer