Instead of having a single attention head, we can use multiple attention heads. We learned how to compute the attention matrix $Z$ . Instead of computing a single attention matrix, $Z$ , we can compute multiple attention matrices. But what is the use of computing multiple attention matrices?

Let's understand this with an example. Consider the phrase:

As we can observe from the preceding figure, the self-attention value of the word 'well' is the sum of the value vectors weighted by the scores. If you look at the preceding figure closely, the attention value of the actual word 'well' is dominated by the other word 'All'. That is, since we are multiplying the value vector of the word 'All' by 0.6 and the value vector of the actual word 'well' by only 0.4, it implies that $Z_{well}$ will contain 60% of the values from the value vector of the word 'All' and only 40% of the values from the value vector of the actual word 'well'. Thus, here the attention value of the actual word 'well' is dominated by the other word 'All'.

This will be useful only in circumstances where the meaning of the actual word is ambiguous. That is, consider the following sentence:

As we can observe from the preceding equation, the attention value of the word 'it' is just the value vector of the word 'dog'. Here, the attention value of the actual word 'it' is dominated by the word 'dog'. But this is fine here since the meaning of the word 'it' is ambiguous, as it may refer to either 'dog' or 'food'.

Thus, if the value vector of other words dominates the actual word in cases as shown in the preceding example, where the actual word is ambiguous, then this dominance is useful; otherwise, it will cause an issue in understanding the right meaning of the word.

How to compute multi-head attention matrices

So, in order to make sure that our results are accurate, instead of computing a single attention matrix, we will compute multiple attention matrices and then concatenate their results. The idea behind using multi-head attention is that instead of using a single attention head, if we use multiple attention heads, then our attention matrix will be more accurate. Let's explore this in more detail.

Let's suppose we are computing two attention matrices $Z_1$ and $Z_2$ . First, let's compute the attention matrix $Z_1$ .

Computing attention matrix $Z_{1}$

...

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Multi-Head Attention Mechanism

How to compute multi-head attention matrices

Computing attention matrix $Z_{1}$

Semantic Search with Transformers

Similarity Detection in English Language Using RoBERTa

Multi-Head Attention Mechanism

How to compute multi-head attention matrices

Computing attention matrix Z1

Computing attention matrix $Z_{1}$