Humans do not tend to pay attention sequentially. We don't typically start looking at a picture from a corner––our focus tends to go to the important features first. For example, when looking at a picture, we often look at the objects in the middle. Humans pay attention in two ways, that is, conscious attention and unconscious attention.
We would like our neural networks to model the behavior of conscious attention so they can pay attention to the important parts of the input first. This is done by attention, more commonly known as the attention mechanism. It is like blurring out the less important features.
In the simplest terms, the attention mechanism works by calculating the weighted sum. This weighted sum tells the model about the importance of features. The process of calculating these weights is crucial and may vary for different applications. Here, we explain the attention mechanism in a general sense which is the same for most of the applications.
The working of the attention mechanism is divided into two parts:
Calculating the attention distribution from the input.
Calculating the context vector using the attention distribution.
The attention distribution is what gives weightage to different inputs. Before calculating this, we first encode our inputs using a neural network. This encoded representation of the input is called keys (
Another thing used to calculate attention distribution is a query (
The selection of a good score function is really important. The two of the most common score functions are additive attention and multiplicative attention. Some of these score functions incorporate learnable parameters. The table below summarizes some of the common score functions.
Score Function | Equation |
Additive | f(q, k) = vTact(W1k + W2q + b) |
Multiplicative | f(q, k) = qTk |
General | f(q, k) = qTWk |
Location-based | f(q, k) = f(q) |
Here,
These energy scores are used to compute
Now, we calculate the context vector
Here,
The attention mechanism is one of the most important concepts in deep learning. It has several applications. Some of them are given below:
Image caption generation
Image-based analysis
Action recognition
Text classification
Machine translation
Speech recognition
Recommendation systems
Attention is used extensively in Natural Language Processing (NLP). However, it is not limited to NLP, it is also used in computer vision. Some famous models that use the attention mechanism are as follows:
GPT-3: Generative Pretrained Transformers version 3, GPT-3, is the transformer-based model that is trained to not only generate text, but to summarize large documents. It is also capable of responding to conversational tasks. Another surprising feature is that it can write code from comments.
The show, attend, and tell model: This model was trained to generate captions for images.
ViT: Vision Transformers, were introduced by Google's research team. This model outperformed the conventional ResNets for computer vision tasks.