Transformers for Computer Vision Applications/

...

Encoder-Decoder Attention

Uncover encoder-decoder attention and autoregressive decoding in transformers for neural machine translation, emphasizing self-attention's pivotal role.

We'll cover the following...

So far, we've discussed the self-attention mechanism. To fully understand the solution presented in the Attention is All You NeedVaswani, Ashish, et al. "Attention Is All You Need." In Advances in Neural Information Processing Systems, 30, 2017. paper for neural machine translation (NMT), we need to explore how the decoder is constructed, just as we've seen how the encoder is built. This leads us to the encoder-decoder attention operation.

Understanding decoder components

Similar to the encoder, the new layer incorporates self-attention and encoder-decoder attention components. This layer combines the encoder layer outputs with the decoder's current output.

Masked decoder self-attention

The encoder-decoder attention layer functions akin to multiheaded self-attention. However, it generates its queries matrix from the layer directly below it while utilizing the keys and values matrix from the encoder stack's output.

In the encoder stack, the process initiates with input embedding, where the input sequence is initially embedded to capture meaningful representations. This is followed by encoder stacking, featuring multiple layers that progressively refine the input representation. At each layer, the encoder generates key and value matrices, encapsulating the information learned up to that point.

In the decoder stack, the decoder begins with input embedding and positional encoding, embedding the target sequence while preserving sequence order with positional information. The decoder stacking process is analogous to the encoder, consisting of multiple layers refining the decoding process. Notably, masked self-attention is employed to ensure each position attends only to preceding positions, preventing information leakage from the future.

The final output of the stacked decoders yields a refined representation of the target sequence. This output leverages intrinsic knowledge from the decoder itself and contextual information from the encoder stack, showcasing the collaborative interplay between the encoder and decoder in capturing dependencies and relationships within the input and target sequences.

Autoencoding vs. autoregressive decoding

In the sequence-to-sequence model with an attention mechanism we've previously described, there are encoder and decoder architectures. The encoder is built entirely using self-attention and multihead attention, as we discussed, with no recurrent connections.

This self-attention can be applied in multiple layers, meaning each layer is a self-attention layer. Consequently, the encoder's token representations are refined at each self-attention layer using multihead attention. Starting from input data $X$ ...

Introduction

Overview of Transformer Networks

Neural Machine Translation with a Transformer and Keras

Transformers in Computer Vision

Vision Transformer for Image Classification

Transformers in Image Classification

Fine-Tuning Vision Transformers for Image Classification

Transformers in Object Detection

Transformers in Semantic Segmentation

Spatio-Temporal Transformers

Object Detection with Vision Transformers

Wrap Up

Encoder-Decoder Attention

Understanding decoder components

Masked decoder self-attention

Autoencoding vs. autoregressive decoding