...

Multi-Head Attention

Learn about the inner workings of the multi-head attention component of the decoder.

We'll cover the following...

How the multi-head attention layer works

Press + to interact

Let's represent the encoder representation with $R$ and the matrix obtained as a result of the masked multi-head attention sublayer with $M$ . Since here we have an interaction between the encoder and decoder, this layer is also called an encoder-decoder attention layer.

How the multi-head attention layer works

Now, let's look into the details and learn how exactly this multi-head attention layer works. The first step in the multi-head attention mechanism is creating the query, key, and value matrices. We learned that we can create the query, key, and value matrices by multiplying the input matrix by the weight matrices. But in this layer, we have two input matrices: one is $R$ (the encoder representation) and the other is $M$ (the attention matrix from the previous sublayer). So, which one should we use?

Computing query, key, and value matrices

We create the query matrix, $Q$ , using the attention matrix, $M$ , obtained from the previous sublayer and we create the key and value matrices using the encoder representation $R$ . Since we are performing the multi-head attention mechanism, for head $i$ , we do the following:

The query matrix, $Q_i$ , is created by multiplying the attention matrix, $M$ , by the weight matrix, $W_i^Q$ .
The key and value matrices are created by multiplying the encoder representation $R$ , by the weight matrices, $W_i^K$ and $W_i^V$ , respectively. This is shown in the following figure: ...

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Multi-Head Attention

How the multi-head attention layer works

Computing query, key, and value matrices