...
/Image Classification with Vision Transformer (ViT and DeiT)
Image Classification with Vision Transformer (ViT and DeiT)
Discover image classification transformations with ViT and DeiT, exploring architectures, and self-attention.
Let's explore the use of transformers in the context of image classification. We'll discuss two significant architectures: the vision transformer (ViT), and the data-efficient image transformer (DeiT).
Traditional image classification architecture
To understand how transformers revolutionize image classification, let's first revisit the conventional architecture, the convolutional neural network (CNN) method. Typically, an image classification system consists of a backbone known as the encoder or feature extractor. This encoder comprises multiple convolution layers followed by max pooling, where the max pool operation often reduces the feature map by half.
The final feature map is then flattened to create the ultimate feature vector. Subsequently, this vector serves as input for the decision layer or decoder, which, in the case of image classification, typically involves a softmax layer. The softmax layer produces class probabilities, determining the output class.
Challenges in replacing the encoder with attention mechanism
Considering the encoder-decoder architecture, we might contemplate replacing the encoder's convolutional block with a special attention mechanism.
However, this straightforward approach encounters a significant drawback—the substantial size of the attention map, proportional to the order of height and width squared
Another challenge in replacing the encoder with an attention mechanism in image classification involves capturing local spatial information effectively. Convolutional layers in traditional architectures like CNNs are inherently designed to capture local features and spatial hierarchies. However, when using transformers ...