Transformers for Computer Vision Applications/

...

Spatial vs. Channel vs. Temporal Attention

Discover attention mechanisms in computer vision, including spatial, channel, and temporal attention, while exploring their applications and implications in feature maps and video frames.

We'll cover the following...

Let's discuss the differences between channel, spatial, and temporal attention mechanisms.

Spatial attention

When working with input feature maps of size $N \cdot H \cdot W$ , spatial attention focuses on aggregating pixels $(H \cdot W)$ using self-attention. The resulting attention map has a size of $HW \cdot HW$ , capturing relations across all pixels.

Self-attention feature map generation

Now, consider self-attention feature map generation. Starting with a 3D image tensor $X$ , where dimensions $N$ , $H$ , and $W$ represent the number of colors, height, and width of either the input image or an intermediate feature map, we can attend to all the two-dimensional spatial relations $(H \cdot W)$ ...