Convolution in Practice

Find out why convolutional and pooling layers are the building blocks of Convolutional Neural Networks.

We'll cover the following...

When it comes to real-life applications, most images are in fact a 3D tensor with width, height, and 3 channels (R,G,B) as dimensions.

In that case, the kernel should also be a 3D tensor (k×k×channelsk \times k \times channels). Each kernel will produce a 2D feature map. Remember the sliding happens only across width and height. We just take the dot product of all the input channels on the computation. Each kernel will produce 1 output channel.

In practice, we tend to use more than 1 kernel in order to capture different kinds of features at the same time.

A convolutional layer
A convolutional layer

As you may have guessed, our learnable weights are now the values of our filters and can be trained with backpropagation, as usual. We can add a bias into each filter as well.

Convolutional layers can be stacked on top of others. Since convolutions are linear operators, we include non-linear activation functions in between just as we did in fully connected layers.

To recap, you have to think in terms of input channels, output channels, and kernel size. And that is exactly how we are going to define it in Pytorch.

To define a convolutional network in Pytorch, we have:

Press + to interact
import torch.nn as nn
conv_layer = nn.Conv2d(in_channels=3, out_channels=5, kernel_size=5)
print(conv_layer)

The above layer will receive an image of 3 channels (i.e., R,G,B) and will output 5 feature maps (channels) using a kernel size of 5x5x3. For simplicity, we just say 5x5 kernel.

Spatial dimensions

Because figuring out the dimensions of each tensor can quickly become complicated, let’s examine a simple example of a convolutional layer. Assume that we have a 7x7 image and a 3x3 filter. The feature map will be of size 5x5.

Spatial dimensions of a convolutional layer example
Spatial dimensions of a convolutional layer example

This is the most common approach but it is not the only one.

Sometimes, we may want to slide/move every 2 pixels instead of every 1, thus introducing an extra hyperparameter called stride.

In some cases, we can also pad the image around the edges with zeros in order to control the output dimensions. The amount of ...