Transformers for Computer Vision Applications/

...

Convolutional Encoders

Get introduced to the world of convolution encoders and explore their crucial role in feature extraction for computer vision.

We'll cover the following...

Understanding convolution encoders
The power of convolution operation
Limitations of the convolutional operation
A simple code implementation

Let's refresh our understanding of how convolution works in computer vision encoders.

Understanding convolution encoders

The core concept of a convolution encoder, also known as a backbone or feature extractor, is to extract local and translation-invariant features. These features are local in the sense that they depend on a specific region of the image, defined by a kernel's scope. They are also translation-invariant, meaning they can identify the same feature even if it shifts within the image. This detection relies on the interaction between kernel weights and the image pixels or the feature map.

Feature extraction process

For instance, if we begin with an image that is, let's say, $448 \times 448$ pixels, we pass it through a series of convolutional layers. Each of these layers typically combines convolution and max-pooling operations, gradually reducing the feature map size.

Press + to interact

The application of a $3\times 3$ kernel, followed by pooling, contributes to this reduction. The final abstracted feature map might be, for example, $7\times 7$ in size. In this $7\times 7$ feature map, there are 49 numbers, and each number represents a region of the input image. To visualize this, one can think of each number as corresponding to a square area in the original image.

Zooming out: Abstracting more information

The key point to understand is that we are not limited to detecting only one feature. The final feature map might be, say, $2\times 2$ ...

Introduction

Overview of Transformer Networks

Neural Machine Translation with a Transformer and Keras

Transformers in Computer Vision

Vision Transformer for Image Classification

Transformers in Image Classification

Fine-Tuning Vision Transformers for Image Classification

Transformers in Object Detection

Transformers in Semantic Segmentation

Spatio-Temporal Transformers

Object Detection with Vision Transformers

Wrap Up

Convolutional Encoders

Understanding convolution encoders

Feature extraction process

Zooming out: Abstracting more information