What is the intuition behind the dot product attention?

Attention mechanism is a machine and deep learning mechanism that allows models to selectively focus on specific parts of the input data while performing a task. It assigns varying degrees of attention to different amounts of the input data. These attention mechanisms are vital to large language models (LLMs) such as GPT (Generative Pre-trained Transformer) models.

Types of attention mechanisms

There are three main types of attention mechanisms:

  • Self-attention: This is an attention mechanism that is used within a sequence. It is also called an intra-attention or internal attention mechanism. It captures dependencies and relationships between tokens, enabling models to understand contextual information and relationships.

  • Encoder-decoder attention: This mechanism is used between sequences where the encoder processes the input sequence, and the decoder generates the output sequence. It is also known as an inter-attention or external attention mechanism. Its strength lies in helping to align the source and target sequence while facilitating information transfer.

  • Multi-head attention: This is a complex mechanism in comparison. It uses learned parameters from different subspaces and performs multiple types of attention simultaneously.

This Answer will focus on the self-attention mechanism. Moreover, we will use scaled dot product attention as its variation. The scaled dot product attention mechanism is a subset of the dot product attention mechanism. It uses a dot product operator with scaling to maximize training data and stability by ensuring the dataset is optimized for the model during training.

Example

Let's start with an example. Our input sentence is "I like to go to the park and play —." In this sentence, we can have some options of what the person would do in a park. Some of these examples can be cricket, football, or basketball. However, we know by looking at the context of the sentence that the options cannot be book, house, or sleep. This is the process we want our model to learn. Using LLMs, our models should understand the relationship in words within the sequence. These relationships or context similarities can be represented in matrices such as heatmaps.

Architecture

The following diagram shows us the methodology that scaled dot product attention uses to determine which words suit our sequence given the context. In simpler terms, this mechanism says that the purpose is to decide which words of a sentence the transformerThe transformer is a groundbreaking neural network architecture should focus on.

Structure of a scaled dot product attention mechanism
Structure of a scaled dot product attention mechanism

The process starts with the input data, which is an introductory sentence in our example. This sentence sequence is divided into tokens of individual words. Then they are processed through preprocessing layers, such as the embedding layer. After that, we achieve our data x[1,...,T]x \in [1, ..., T]where TT is the number of tokens. From this, we create three new vectors.

  • Query vector (QQ): It is derived from the current position that the attention mechanism is focused on. These can be learned by qi=xiWqq_i = x_i W_q for i[1,...,T]i \in [1, ..., T]and where WWis a learned value weight matrix. (QRdkQ \in \mathbb{R^{d_k}}).

  • Key vector (KK): It is derived from all the positions in the input sequence. These can be learned by ki=xiWkk_i = x_i W_k for i[1,...,T]i \in [1, ..., T] , where WWis a learned value weight matrix. (KRdkK \in \mathbb{R^{d_k}}).

  • Value vector (VV): It contains information associated with each position in the input sequence. These can be learned by vi=xiWvv_i = x_i W_v for i[1,...,T]i \in [1, ..., T], where WWis a learned value weight matrix. (VRdvV \in \mathbb{R^{d_v}}).

Note: The dimensions of vectors QQ and KK must be the same to allow matrix operations in the future. However, vector VVcan have different dimensions. In our example, we will keep all three the same for simplicity.

First layer

First, we start step one. In this, we calculate the dot product of vectors QQ and KK. However, we cannot do this initially as our dimensions do not match. To fix this, we transpose the vector KK and proceed. This product will give us a scalar value that we can denote as QKTQK^T.

Matrix multiplication of Q vector and transpose of K vector
Matrix multiplication of Q vector and transpose of K vector

Second layer

Second, we scale our data. We do this when our dimensions are larger, and the variance of QKTQK^T will increase. This is beneficial to us in the future as, in the next step, it prevents the vanishing gradient problem faced in softmax. To scale it, we use a scale factor of 1/d1/\sqrt{d} to achieve a final form QKT/dQK^T/\sqrt{d}.

Matrix scaling using a factor of square root d
Matrix scaling using a factor of square root d

Third layer

Third, we will create an attention matrix named AA. To do this, we pass our processed matrix into a softmax function. The softmax function takes an input vector and returns an output vector containing every element's probability value between 0 and 1, where the sum of all probabilities is 1.

Applying softmax activiation function on processed matrix
Applying softmax activiation function on processed matrix

Fourth layer

Finally, we will again use dot product multiplication. This time, we will multiply our AA matrix with our vector VV. This will create a final vector RRT×dR \in \mathbb{R^{T \times d}} representing the attended information or context based on the attention weights. This can further be used as an input for other tasks or another model to build upon.

Matrix muliplication of attention matrix and vector V
Matrix muliplication of attention matrix and vector V

Conclusion

With that, we have the final equation of our attention mechanism.

To sum it up, dot product attention is a powerful tool for modeling dependencies and capturing relevant information in sequences. It measures the similarity and importance of elements which allows models to selectively focus on crucial inputs leading to informed predictions and decisions. This enhances performance in machine and deep learning models by dynamically attending to relevant information and making context-aware decisions.

Copyright ©2024 Educative, Inc. All rights reserved