Let's explore how we can employ self-attention in computer vision.
Comparing self-attention and convolution in computer vision
The process of generating self-attended feature maps involves a series of transformations applied to a 3D image representation, denoted as X. Initially, X is a 2D image transformed by weight matrices W, with an added third dimension representing channel information, such as different colors.
The first step involves the extraction of three sets of weight matrices, namely Wk, Wq, and Wv. Subsequently, these weight matrices are applied to the original image representation X to form three distinct matrices: K, Q, and V. The attention mechanism comes into play as it calculates an attention map based on the relationships between K and Q, highlighting the relevance of each element in the feature map. Following this, a dot product operation is applied between the original values V and the attention map, focusing on important features determined by the attention mechanism. The final output is the self-attended feature map obtained through the weighted combination of V using the attention map. This comprehensive process systematically captures significant patterns and relationships within the original image feature maps, providing a refined and context-aware representation.