Understanding the Self-Attention Mechanism
Let’s go through a step by step explanation of self-attention mechanisms.
We'll cover the following...
How can we create the query, key, and value matrices? To create these, we introduce three new weight matrices called
Note: The weight matrices , , and are randomly initialized, and their optimal values will be learned during training. As we learn the optimal weights, we will obtain more accurate query, key, and value matrices.
As shown in the following figure, multiplying the input matrix ,
From the preceding figure, we can understand the following:
The first row in the query, key, and value matrices—
, , and —implies the query, key, and value vectors of the word 'I'. The second row in the query, key, and value matrices—
, , and —implies the query, key, and value vectors of the word 'am'. The third row in the query, key, and value matrices—
, , and —implies the query, key, and value vectors of the word 'good'.
Dimensions of the query, key, and value matrices
Note that the dimensionality of the query, key, and value vectors is 64. Thus, the dimension of our query, key, and value matrices is:
Since we have three words in the sentence, the dimensions of the query, key, and value matrices are:
But still, the ultimate question is, why are we computing this? What is the use of query, key, and value matrices? How is this going to help us? This is exactly what we will discuss in detail in the next section.
How the self-attention mechanism works
We learned how to compute query,
We learned that in order to compute a representation of a word, the self-attention mechanism relates the word to all the words in the given sentence. Consider the sentence 'I am good'. To compute the representation of the word 'I', we relate the word 'I' to all the words in the sentence, as shown in the following figure:
But why do we need to do this? Understanding how a word is related to all the words in the sentence helps us to learn better representation. Now, let's learn how the self-attention mechanism relates a word to all the words in the sentence using the query, key, and value matrices. The self-attention mechanism ...