Transformers for Computer Vision Applications/

...

Local vs. Global Attention

Explore the distinctions between global and local attention mechanisms, uncovering the efficiency and dynamic nature of local attention.

We'll cover the following...

Local attention mechanism
- Non-local attention
- Comparing global and local attention
The need for local attention
Efficiency of local attention
Criss-cross attention mechanism
A simple code implementation
Exploring self-attention design space

We've previously explored global attention mechanisms, which establish connections across all inputs—be they spatial, channel-related, or temporal. Now, let's explore another critical aspect: local attention.

Local attention mechanism

As known, convolution is a local operation, due to its inductive bias or modeling assumption, while attention was identified as global, devoid of modeling assumptions, or low in inductive bias. Spatial attention, as depicted, links each blue pixel in space to a red pixel, capturing their relationship through an attention map. This is known as non-local attention, although other options are available.

Press + to interact

The matrix depicted in the above illustration represents the attention distribution within a spatial context. Each element in the matrix corresponds to a position in the input space, and the intensity of the connections between elements is visually represented by the color scale.

The gray matrix in the lower middle signifies a non-local attention pattern. Unlike local operations such as convolution, where interactions are confined to a specific neighborhood, non-local attention allows each position in the input space to contribute to the attention mechanism without restrictions.

Non-local attention

In the world of self-attention mechanisms, two fundamental design approaches emerge, global self-attention and local self-attention.

Global self-attention, as the name implies, operates without constraints imposed by input feature size. It encompasses the entire feature map, allowing each position to attend to every other position within the map.

On the other hand, local self-attention, analogous to convolution, focuses on modeling relations within a specified neighborhood. This localized attention is restricted to a predefined window or patch around a given pixel, akin to how convolution operates with a kernel. The explicit consideration of a defined window serves to mitigate computational overhead.

Understanding local attention in relation to convolution involves recognizing the similarity in their operational principles within a specified spatial context. While convolution employs a kernel to process local patches, local attention achieves a similar effect by attending to neighboring positions within a designated window and, in doing so, balancing computational efficiency with the modeling of spatial relationships.

Comparing global and local attention

Consider global self-attention like a fully connected layer, where every ...

Introduction

Overview of Transformer Networks

Neural Machine Translation with a Transformer and Keras

Transformers in Computer Vision

Vision Transformer for Image Classification

Transformers in Image Classification

Fine-Tuning Vision Transformers for Image Classification

Transformers in Object Detection

Transformers in Semantic Segmentation

Spatio-Temporal Transformers

Object Detection with Vision Transformers

Wrap Up

Local vs. Global Attention

Local attention mechanism

Non-local attention

Comparing global and local attention