Let's explore how we can employ self-attention in computer vision.

Comparing self-attention and convolution in computer vision

The process of generating self-attended feature maps involves a series of transformations applied to a 3D image representation, denoted as XX. Initially, XX is a 2D image transformed by weight matrices WW, with an added third dimension representing channel information, such as different colors.

The first step involves the extraction of three sets of weight matrices, namely WkW_k, WqW_q, and WvW_v. Subsequently, these weight matrices are applied to the original image representation XX to form three distinct matrices: KK, QQ, and VV. The attention mechanism comes into play as it calculates an attention map based on the relationships between KK and QQ, highlighting the relevance of each element in the feature map. Following this, a dot product operation is applied between the original values VV and the attention map, focusing on important features determined by the attention mechanism. The final output is the self-attended feature map obtained through the weighted combination of VV using the attention map. This comprehensive process systematically captures significant patterns and relationships within the original image feature maps, providing a refined and context-aware representation.

Get hands-on with 1400+ tech skills courses.