Attention mechanism is a machine and deep learning mechanism that allows models to selectively focus on specific parts of the input data while performing a task. It assigns varying degrees of attention to different amounts of the input data. These attention mechanisms are vital to large language models (LLMs) such as GPT (Generative Pre-trained Transformer) models.
There are three main types of attention mechanisms:
Self-attention: This is an attention mechanism that is used within a sequence. It is also called an intra-attention or internal attention mechanism. It captures dependencies and relationships between tokens, enabling models to understand contextual information and relationships.
Encoder-decoder attention: This mechanism is used between sequences where the encoder processes the input sequence, and the decoder generates the output sequence. It is also known as an inter-attention or external attention mechanism. Its strength lies in helping to align the source and target sequence while facilitating information transfer.
Multi-head attention: This is a complex mechanism in comparison. It uses learned parameters from different subspaces and performs multiple types of attention simultaneously.
This Answer will focus on the self-attention mechanism. Moreover, we will use scaled dot product attention as its variation. The scaled dot product attention mechanism is a subset of the dot product attention mechanism. It uses a dot product operator with scaling to maximize training data and stability by ensuring the dataset is optimized for the model during training.
Let's start with an example. Our input sentence is "I like to go to the park and play —." In this sentence, we can have some options of what the person would do in a park. Some of these examples can be cricket, football, or basketball. However, we know by looking at the context of the sentence that the options cannot be book, house, or sleep. This is the process we want our model to learn. Using LLMs, our models should understand the relationship in words within the sequence. These relationships or context similarities can be represented in matrices such as heatmaps.
The following diagram shows us the methodology that scaled dot product attention uses to determine which words suit our sequence given the context. In simpler terms, this mechanism says that the purpose is to decide which words of a sentence the
The process starts with the input data, which is an introductory sentence in our example. This sentence sequence is divided into tokens of individual words. Then they are processed through preprocessing layers, such as the embedding layer. After that, we achieve our data
Query vector (
Key vector (
Value vector (
Note: The dimensions of vectors
and must be the same to allow matrix operations in the future. However, vector can have different dimensions. In our example, we will keep all three the same for simplicity.
First, we start step one. In this, we calculate the dot product of vectors
Second, we scale our data. We do this when our dimensions are larger, and the variance of
Third, we will create an attention matrix named
Finally, we will again use dot product multiplication. This time, we will multiply our
With that, we have the final equation of our attention mechanism.
To sum it up, dot product attention is a powerful tool for modeling dependencies and capturing relevant information in sequences. It measures the similarity and importance of elements which allows models to selectively focus on crucial inputs leading to informed predictions and decisions. This enhances performance in machine and deep learning models by dynamically attending to relevant information and making context-aware decisions.