...

/

Image Classification with Vision Transformer (ViT and DeiT)

Image Classification with Vision Transformer (ViT and DeiT)

Discover image classification transformations with ViT and DeiT, exploring architectures, and self-attention.

Let's explore the use of transformers in the context of image classification. We'll discuss two significant architectures: the vision transformer (ViT), and the data-efficient image transformer (DeiT).

Traditional image classification architecture

To understand how transformers revolutionize image classification, let's first revisit the conventional architecture, the convolutional neural network (CNN) method. Typically, an image classification system consists of a backbone known as the encoder or feature extractor. This encoder comprises multiple convolution layers followed by max pooling, where the max pool operation often reduces the feature map by half.

Press + to interact
Representations learnt by a deep network for digit classification
Representations learnt by a deep network for digit classification

The final feature map is then flattened to create the ultimate feature vector. Subsequently, this vector serves as input for the decision layer or decoder, which, in the case of image classification, typically involves a softmax layer. The softmax layer produces class probabilities, determining the output class.

Challenges in replacing the encoder with attention mechanism

Considering the encoder-decoder architecture, we might contemplate replacing the encoder's convolutional block with a special attention mechanism.

Press + to interact
 Universal encoder-decoder design pattern
Universal encoder-decoder design pattern

However, this straightforward approach encounters a significant drawback—the substantial size of the attention map, proportional to the order of height and width squared (O(HW)2)(O(H \cdot W)^2). The computation and memory costs associated with such large attention maps pose a severe limitation.

Another challenge in replacing the encoder with an attention mechanism in image classification involves capturing local spatial information effectively. Convolutional layers in traditional architectures like CNNs are inherently designed to capture local features and spatial hierarchies. However, when using transformers ...

Access this course and 1400+ top-rated courses and projects.