Search⌘ K

Spatial vs. Channel vs. Temporal Attention

Understand the differences between spatial, channel, and temporal attention mechanisms used in vision transformers. Learn how each attention type processes image and video data to capture relationships across pixels, channels, and time frames. Gain practical insights including a basic implementation of spatial self-attention to enhance feature maps for computer vision tasks.

Let's discuss the differences between channel, spatial, and temporal attention mechanisms.

Spatial attention

When working with input feature maps of size NHWN \cdot H \cdot W, spatial attention focuses on aggregating pixels(HW)(H \cdot W) using self-attention. The resulting attention map has a size of HWHWHW \cdot HW, capturing relations across all pixels.

Self-attention feature map generation

Now, consider self-attention feature map generation. Starting with a 3D image tensor XX, where dimensions NN, HH, and WW represent the number of colors, height, and width of either the input image or an intermediate feature map, we can attend to all the two-dimensional spatial relations (HW)(H \cdot W) ...