What is attention?

Share

An overview of attention

Humans do not tend to pay attention sequentially. We don't typically start looking at a picture from a corner––our focus tends to go to the important features first. For example, when looking at a picture, we often look at the objects in the middle. Humans pay attention in two ways, that is, conscious attention and unconscious attention.

We would like our neural networks to model the behavior of conscious attention so they can pay attention to the important parts of the input first. This is done by attention, more commonly known as the attention mechanism. It is like blurring out the less important features.

An example of attention mechanism
An example of attention mechanism

How it works

In the simplest terms, the attention mechanism works by calculating the weighted sum. This weighted sum tells the model about the importance of features. The process of calculating these weights is crucial and may vary for different applications. Here, we explain the attention mechanism in a general sense which is the same for most of the applications.

The working of the attention mechanism is divided into two parts:

  • Calculating the attention distribution from the input.

  • Calculating the context vector using the attention distribution.

The attention distribution is what gives weightage to different inputs. Before calculating this, we first encode our inputs using a neural network. This encoded representation of the input is called keys (KK).

Another thing used to calculate attention distribution is a query (qq). It is a task-related representation that can be a vector or a matrix depending upon the task. A neural network computes the correlation between the keys and the query by using the score function ff. The score function explains how keys and queries are related by giving us an energy score.

The selection of a good score function is really important. The two of the most common score functions are additive attention and multiplicative attention. Some of these score functions incorporate learnable parameters. The table below summarizes some of the common score functions.

Score Function

Equation

Additive

f(q, k) = vTact(W1k + W2q + b)

Multiplicative

f(q, k) = qTk

General

f(q, k) = qTWk

Location-based

f(q, k) = f(q)

Here, v,W,W1,W2,bv, W, W_1, W_2, b are the learnable parameters and actact is the activation function.

These energy scores are used to compute α\alpha attention weights by using a distribution function gg. This distribution function gg varies according to the application in which we are using the attention mechanism.

Now, we calculate the context vector cc. Before that, we need another vector to introduce a new vector representation VV called values. This vector has a one-on-one correspondence with the keys KK. In some cases, the vector VV is the same as KK. The context vector cc is calculated as:

Here, ϕ\phi is a function that returns a single vector which we refer to as the context vector. In most cases, ϕ\phi performs the weighted sum. Our neural networks will now use this context vector to make predictions.

The working of attention mechanism

Applications

The attention mechanism is one of the most important concepts in deep learning. It has several applications. Some of them are given below:

  • Image caption generation

  • Image-based analysis

  • Action recognition

  • Text classification

  • Machine translation

  • Speech recognition

  • Recommendation systems

Models that use attention mechanism

Attention is used extensively in Natural Language Processing (NLP). However, it is not limited to NLP, it is also used in computer vision. Some famous models that use the attention mechanism are as follows:

  • GPT-3: Generative Pretrained Transformers version 3, GPT-3, is the transformer-based model that is trained to not only generate text, but to summarize large documents. It is also capable of responding to conversational tasks. Another surprising feature is that it can write code from comments.

  • The show, attend, and tell model: This model was trained to generate captions for images.

  • ViT: Vision Transformers, were introduced by Google's research team. This model outperformed the conventional ResNets for computer vision tasks.

Copyright ©2024 Educative, Inc. All rights reserved