What is the intuition behind the dot product attention?

Attention mechanism is a machine and deep learning mechanism that allows models to selectively focus on specific parts of the input data while performing a task. It assigns varying degrees of attention to different amounts of the input data. These attention mechanisms are vital to large language models (LLMs) such as GPT (Generative Pre-trained Transformer) models.

Types of attention mechanisms

There are three main types of attention mechanisms:

Self-attention: This is an attention mechanism that is used within a sequence. It is also called an intra-attention or internal attention mechanism. It captures dependencies and relationships between tokens, enabling models to understand contextual information and relationships.
Encoder-decoder attention: This mechanism is used between sequences where the encoder processes the input sequence, and the decoder generates the output sequence. It is also known as an inter-attention or external attention mechanism. Its strength lies in helping to align the source and target sequence while facilitating information transfer.
Multi-head attention: This is a complex mechanism in comparison. It uses learned parameters from different subspaces and performs multiple types of attention simultaneously.

This Answer will focus on the self-attention mechanism. Moreover, we will use scaled dot product attention as its variation. The scaled dot product attention mechanism is a subset of the dot product attention mechanism. It uses a dot product operator with scaling to maximize training data and stability by ensuring the dataset is optimized for the model during training.

Example

Let's start with an example. Our input sentence is "I like to go to the park and play —." In this sentence, we can have some options of what the person would do in a park. Some of these examples can be cricket, football, or basketball. However, we know by looking at the context of the sentence that the options cannot be book, house, or sleep. This is the process we want our model to learn. Using LLMs, our models should understand the relationship in words within the sequence. These relationships or context similarities can be represented in matrices such as heatmaps.

Architecture

The following diagram shows us the methodology that scaled dot product attention uses to determine which words suit our sequence given the context. In simpler terms, this mechanism says that the purpose is to decide which words of a sentence the transformerThe transformer is a groundbreaking neural network architecture should focus on.

The process starts with the input data, which is an introductory sentence in our example. This sentence sequence is divided into tokens of individual words. Then they are processed through preprocessing layers, such as the embedding layer. After that, we achieve our data $x \in [1, ..., T]$ where $T$ is the number of tokens. From this, we create three new vectors.

Query vector ( $Q$ ): It is derived from the current position that the attention mechanism is focused on. These can be learned by $q_i = x_i W_q$ for $i \in [1, ..., T]$ and where $W$ is a learned value weight matrix. ( $Q \in \mathbb{R^{d_k}}$ ).
Key vector ( $K$ ): It is derived from all the positions in the input sequence. These can be learned by $k_i = x_i W_k$ for $i \in [1, ..., T]$ , where $W$ is a learned value weight matrix. ( $K \in \mathbb{R^{d_k}}$ ).
Value vector ( $V$ ): It contains information associated with each position in the input sequence. These can be learned by $v_i = x_i W_v$ for $i \in [1, ..., T]$ , where $W$ is a learned value weight matrix. ( $V \in \mathbb{R^{d_v}}$ ).

Note: The dimensions of vectors $Q$ and $K$ must be the same to allow matrix operations in the future. However, vector $V$ can have different dimensions. In our example, we will keep all three the same for simplicity.

First layer

First, we start step one. In this, we calculate the dot product of vectors $Q$ and $K$ . However, we cannot do this initially as our dimensions do not match. To fix this, we transpose the vector $K$ and proceed. This product will give us a scalar value that we can denote as $QK^T$ .