...

/

Multi-Head Attention

Multi-Head Attention

Learn about the inner workings of the multi-head attention component of the decoder.

The following figure shows the transformer model with both the encoder and decoder. As we can observe, the multi-head attention sublayer in each decoder receives two inputs: one is from the previous sublayer, masked multi-head attention, and the other is the encoder representation:

Press + to interact
 Encoder-decoder interaction
Encoder-decoder interaction

Let's represent the encoder representation with RR and the matrix obtained as a result of the masked multi-head attention sublayer with MM. Since here we have an interaction between the encoder and decoder, this layer is also called an encoder-decoder attention layer.

How the multi-head attention layer works

Now, let's look into the details and learn how exactly this multi-head attention layer works. The first step in the multi-head attention mechanism is creating the query, key, and value matrices. We learned that we can create the query, key, and value matrices by multiplying the input matrix by the weight matrices. But in this layer, we have two input matrices: one is RR (the encoder representation) and the other is MM (the attention matrix from the previous sublayer). So, which one should we use?

Computing query, key, and value matrices

We create the query matrix, QQ, using the attention matrix, MM, obtained from the previous sublayer and we create the key and value matrices using the encoder representation RR. Since we are performing the multi-head attention mechanism, for head ii, we do the following:

  • The query matrix, QiQ_i , is created by multiplying the attention matrix, MM, by the weight matrix, WiQW_i^Q.

  • The key and value matrices are created by multiplying the encoder representation RR, by the weight matrices, WiKW_i^Kand WiVW_i^V, respectively. This is shown in the following figure: ...