...

Understanding the Self-Attention Mechanism

Let’s go through a step by step explanation of self-attention mechanisms.

We'll cover the following...

Dimensions of the query, key, and value matrices
How the self-attention mechanism works
Summary

Press + to interact

From the preceding figure, we can understand the following:

The first row in the query, key, and value matrices— $q_1$ , $k_1$ , and $v_1$ —implies the query, key, and value vectors of the word 'I'.
The second row in the query, key, and value matrices— $q_2$ , $k_2$ , and $v_2$ —implies the query, key, and value vectors of the word 'am'.
The third row in the query, key, and value matrices— $q_3$ , $k_3$ , and $v_3$ —implies the query, key, and value vectors of the word 'good'.

Dimensions of the query, key, and value matrices

Note that the dimensionality of the query, key, and value vectors is 64. Thus, the dimension of our query, key, and value matrices is:

But still, the ultimate question is, why are we computing this? What is the use of query, key, and value matrices? How is this going to help us? This is exactly what we will discuss in detail in the next section.

How the self-attention mechanism works

We learned how to compute query, $Q$ , key, $K$ , and value, $V$ , matrices and we also learned that they are obtained from the input matrix, $X$ . Now, let's see how the query, key, and value matrices are used in the self-attention mechanism.

We learned that in order to compute a representation of a word, the self-attention mechanism relates the word to all the words in the given sentence. Consider the sentence 'I am good'. To compute the representation of the word 'I', we relate the word 'I' to all the words in the sentence, as shown in the following figure:

Press + to interact

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Understanding the Self-Attention Mechanism

Dimensions of the query, key, and value matrices

How the self-attention mechanism works