Spatial vs. Channel vs. Temporal Attention

Discover attention mechanisms in computer vision, including spatial, channel, and temporal attention, while exploring their applications and implications in feature maps and video frames.

Let's discuss the differences between channel, spatial, and temporal attention mechanisms.

Spatial attention

When working with input feature maps of size NHWN \cdot H \cdot W, spatial attention focuses on aggregating pixels(HW)(H \cdot W) using self-attention. The resulting attention map has a size of HWHWHW \cdot HW, capturing relations across all pixels.

Self-attention feature map generation

Now, consider self-attention feature map generation. Starting with a 3D image tensor XX, where dimensions NN, HH, and WW represent the number of colors, height, and width of either the input image or an intermediate feature map, we can attend to all the two-dimensional spatial relations (HW)(H \cdot W).

Get hands-on with 1400+ tech skills courses.