Multi-Head Attention
Learn about the inner workings of the multi-head attention component of the decoder.
The following figure shows the transformer model with both the encoder and decoder. As we can observe, the multi-head attention sublayer in each decoder receives two inputs: one is from the previous sublayer, masked multi-head attention, and the other is the encoder representation:
Let's represent the encoder representation with
How the multi-head attention layer works
Now, let's look into the details and learn how exactly this multi-head attention layer works. The first step in the multi-head attention mechanism is creating the query, key, and value matrices. We learned that we can create the query, key, and value matrices by multiplying the input matrix by the weight matrices. But in this layer, we have two input matrices: one is
Computing query, key, and value matrices
We create the query matrix,
The query matrix,
, is created by multiplying the attention matrix, , by the weight matrix, . The key and value matrices are created by multiplying the encoder representation
, by the weight matrices, and , respectively. This is shown in the following figure: ...