Encoder-Decoder Design Pattern
Learn about the fundamental concepts of computer vision, including the fundamental component: encoder-decoder.
Before we dive in, let's recap what we aim to achieve in computer vision (CV).
What do we want from computer vision?
In general, CV tasks fall into two categories: image analysis and image synthesis. This course primarily focuses on the analysis aspect.
We'll study the distinctions between various CV tasks, such as image classification, object detection (whether single or multiple object detection), and different types of segmentation like instance segmentation, semantic segmentation, and panoptic segmentation, later.
As we progress, we'll explore how transformers handle each of these tasks and discuss the current state of transformer models for each.
Our objective is to observe how all these tasks share a common model design pattern. This pattern includes a feature extractor or backbone followed by a decision layer that produces an output.
For instance, in multiclass classification or face recognition, we use certain layers, which can be convolutional or fully connected. These layers encode the features, and their weights adapt to the model type. Following this, we have a softmax layer at the output, responsible for identifying a person in an image. The softmax output is typically a linear combination of the output features from the feature block, and these weights are trainable. We then get a score for each class and determine the softmax output to produce a final output using the
This encoder-decoder design pattern is also applicable beyond classification. It can be employed in various CV tasks. In machine learning (ML), the features can either be handcrafted or learnable, as we know from the basics of ML and deep learning (DL).
The universal encoder-decoder architecture
This design pattern is universal and serves as a master architecture in DL, not limited to classification but extending to other CV tasks. For instance, in object detection, we can have multiple decoders for different aspects, such as object class and bounding box coordinates. For segmentation, we might use an inverted decoder, as seen in the UNet architecture, where we have convolution in the encoder, deconvolution in the decoder, and the input and output of the same size.
This encoder-decoder design pattern is not exclusive to CV. We find it in neural machine translation as well, with an encoder that digests the input sequence and a decoder that generates the output based on the encoder's information. This same concept applies to transformer models.
Key idea: Encoder-decoder architecture
To generalize, the encoder-decoder architecture involves two main stages: the ...