Multi-Head Attention Mechanism
Learn about the need for multiple attention matrices and how to compute them.
Instead of having a single attention head, we can use multiple attention heads. We learned how to compute the attention matrix
Let's understand this with an example. Consider the phrase:
Say we need to compute the self-attention of the word 'well'. After computing the similarity score, suppose we have the following:
As we can observe from the preceding figure, the self-attention value of the word 'well' is the sum of the value vectors weighted by the scores. If you look at the preceding figure closely, the attention value of the actual word 'well' is dominated by the other word 'All'. That is, since we are multiplying the value vector of the word 'All' by 0.6 and the value vector of the actual word 'well' by only 0.4, it implies that
This will be useful only in circumstances where the meaning of the actual word is ambiguous. That is, consider the following sentence:
Say we are computing the self-attention for the word 'it'. After computing the similarity score, suppose we have the following:
As we can observe from the preceding equation, the attention value of the word 'it' is just the value vector of the word 'dog'. Here, the attention value of the actual word 'it' is dominated by the word 'dog'. But this is fine here since the meaning of the word 'it' is ambiguous, as it may refer to either 'dog' or 'food'.
Thus, if the value vector of other words dominates the actual word in cases as shown in the preceding example, where the actual word is ambiguous, then this dominance is useful; otherwise, it will cause an issue in understanding the right meaning of the word.
How to compute multi-head attention matrices
So, in order to make sure that our results are accurate, instead of computing a single attention matrix, we will compute multiple attention matrices and then concatenate their results. The idea behind using multi-head attention is that instead of using a single attention head, if we use multiple attention heads, then our attention matrix will be more accurate. Let's explore this in more detail.
Let's suppose we are computing two attention matrices