Let's explore the use of transformers in the context of image classification. We'll discuss two significant architectures: the vision transformer (ViT), and the data-efficient image transformer (DeiT).

Traditional image classification architecture

To understand how transformers revolutionize image classification, let's first revisit the conventional architecture, the convolutional neural network (CNN) method. Typically, an image classification system consists of a backbone known as the encoder or feature extractor. This encoder comprises multiple convolution layers followed by max pooling, where the max pool operation often reduces the feature map by half.

Get hands-on with 1400+ tech skills courses.