Encoder-Decoder Attention
Uncover encoder-decoder attention and autoregressive decoding in transformers for neural machine translation, emphasizing self-attention's pivotal role.
So far, we've discussed the self-attention mechanism. To fully understand the solution presented in the
Understanding decoder components
Similar to the encoder, the new layer incorporates self-attention and encoder-decoder attention components. This layer combines the encoder layer outputs with the decoder's current output.
Masked decoder self-attention
The encoder-decoder attention layer functions akin to multiheaded self-attention. However, it generates its queries matrix from the layer directly below it while utilizing the keys and values matrix from the encoder stack's output.
In the encoder stack, the process initiates with input embedding, where the input sequence is initially embedded to capture meaningful representations. This is followed by encoder stacking, featuring multiple layers that progressively refine the input representation. At each layer, the encoder generates key and value matrices, encapsulating the information learned up to that point.
In the decoder stack, the decoder begins with input embedding and positional encoding, embedding the target sequence while preserving sequence order with positional information. The decoder stacking process is analogous to the encoder, consisting of multiple layers refining the decoding process. Notably, masked self-attention is employed to ensure each position attends only to preceding positions, preventing information leakage from the future.
The final output of the stacked decoders yields a refined representation of the target sequence. This output leverages intrinsic knowledge from the decoder itself and contextual information from the encoder stack, showcasing the collaborative interplay between the encoder and decoder in capturing dependencies and relationships within the input and target sequences.
Autoencoding vs. autoregressive decoding
In the sequence-to-sequence model with an attention mechanism we've previously described, there are encoder and decoder architectures. The encoder is built entirely using self-attention and multihead attention, as we discussed, with no recurrent connections.
This self-attention can be applied in multiple layers, meaning each layer is a self-attention layer. Consequently, the encoder's token representations are refined at each self-attention layer using multihead attention. Starting from input data