Transformers for Computer Vision Applications/

...

Encoder-Decoder Design Pattern

Learn about the fundamental concepts of computer vision, including the fundamental component: encoder-decoder.

We'll cover the following...

For instance, in multiclass classification or face recognition, we use certain layers, which can be convolutional or fully connected. These layers encode the features, and their weights adapt to the model type. Following this, we have a softmax layer at the output, responsible for identifying a person in an image. The softmax output is typically a linear combination of the output features from the feature block, and these weights are trainable. We then get a score for each class and determine the softmax output to produce a final output using the $\text{arg max}$ operation.

This encoder-decoder design pattern is also applicable beyond classification. It can be employed in various CV tasks. In machine learning (ML), the features can either be handcrafted or learnable, as we know from the basics of ML and deep learning (DL).

The universal encoder-decoder architecture

This design pattern is universal and serves as a master architecture in DL, not limited to classification but extending to other CV tasks. For instance, in object detection, we can have multiple decoders for different aspects, such as object class and bounding box coordinates. For segmentation, we might use an inverted decoder, as seen in the UNet architecture, where we have convolution in the encoder, deconvolution in the decoder, and the input and output of the same size.

Encoder-Decoder Design Pattern

What do we want from computer vision?

The universal encoder-decoder architecture

Key idea: Encoder-decoder architecture