Convolutional Encoders

Let's refresh our understanding of how convolution works in computer vision encoders.

Understanding convolution encoders

The core concept of a convolution encoder, also known as a backbone or feature extractor, is to extract local and translation-invariant features. These features are local in the sense that they depend on a specific region of the image, defined by a kernel's scope. They are also translation-invariant, meaning they can identify the same feature even if it shifts within the image. This detection relies on the interaction between kernel weights and the image pixels or the feature map.

Feature extraction process

For instance, if we begin with an image that is, let's say, 448×448448 \times 448 pixels, we pass it through a series of convolutional layers. Each of these layers typically combines convolution and max-pooling operations, gradually reducing the feature map size.

Get hands-on with 1400+ tech skills courses.