Masked Multi-Head Attention
Learn about the masked-multi head attention mechanism and how it works.
In our English-to-French translation task, say our training dataset looks like the one shown here:
A sample training set
Source sentence | Target sentence |
I am good | Je vais bien |
Good morning | Bonjour |
Thank you very much | Merci beaucoup |
By looking at the preceding dataset, we can understand that we have source and target sentences. We saw how the decoder predicts the target sentence word by word in each time step and that happens only during testing.
During training, since we have the right target sentence, we can just feed the whole target sentence as input to the decoder but with a small modification. We learned that the decoder takes the input
Say we are converting the English sentence 'I am good' to the French sentence 'Je vais bien'. We can just add the
But how does this work? Isn't this kind of ambiguous? Why do we need to feed the entire target sentence and let the decoder predict the shifted target sentence as output? Let's explore this in more detail.
We learned that instead of feeding the input directly to the decoder, we convert it into an embedding (output embedding matrix) and add positional encoding, and then feed it to the decoder. Let's suppose the following matrix,
Now, we feed the preceding matrix
Computing query, key, and value matrices
To perform self-attention, we create three new matrices, called query