...

/

Understanding the Self-Attention Mechanism

Understanding the Self-Attention Mechanism

Let’s go through a step by step explanation of self-attention mechanisms.

How can we create the query, key, and value matrices? To create these, we introduce three new weight matrices called WQW^Q , WKW^K , WVW^V. We create the query, QQ, key, KK, and value, VV, matrices by multiplying the input matrix XX by WQW^Q , WKW^K , and WVW^V respectively.

Note: The weight matrices WQW^Q, WKW^K, and WVW^V are randomly initialized, and their optimal values will be learned during training. As we learn the optimal weights, we will obtain more accurate query, key, and value matrices.

As shown in the following figure, multiplying the input matrix ,XX, by the weight matrices WQW^Q , WKW^K , and WVW^V, we obtain the query, key, and value matrices:

Press + to interact
Creating query, key, and value matrices
Creating query, key, and value matrices

From the preceding figure, we can understand the following:

  • The first row in the query, key, and value matrices—q1q_1 , k1k_1 , and v1v_1—implies the query, key, and value vectors of the word 'I'.

  • The second row in the query, key, and value matrices— q2q_2 , k2k_2 , and v2v_2 —implies the query, key, and value vectors of the word 'am'.

  • The third row in the query, key, and value matrices— q3q_3 , k3k_3 , and v3v_3 —implies the query, key, and value vectors of the word 'good'.

Dimensions of the query, key, and value matrices

Note that the dimensionality of the query, key, and value vectors is 64. Thus, the dimension of our query, key, and value matrices is:

Since we have three words in the sentence, the dimensions of the query, key, and value matrices are:

But still, the ultimate question is, why are we computing this? What is the use of query, key, and value matrices? How is this going to help us? This is exactly what we will discuss in detail in the next section.

How the self-attention mechanism works

We learned how to compute query, QQ, key, KK, and value, VV, matrices and we also learned that they are obtained from the input matrix, XX. Now, let's see how the query, key, and value matrices are used in the self-attention mechanism.

We learned that in order to compute a representation of a word, the self-attention mechanism relates the word to all the words in the given sentence. Consider the sentence 'I am good'. To compute the representation of the word 'I', we relate the word 'I' to all the words in the sentence, as shown in the following figure:

Press + to interact
Self-attention example
Self-attention example

But why do we need to do this? Understanding how a word is related to all the words in the sentence helps us to learn better representation. Now, let's learn how the self-attention mechanism relates a word to all the words in the sentence using the query, key, and value matrices. The self-attention mechanism ...