ResNet (2015)

Learn the fundamentals of the ResNet image classification architecture with vanishing and exploding gradient problems.

Resnet is the image classification architecture that won the LSVRC competition in 2015. This model’s novel approach to a basic CNN structure has a wide range of usage in future architectures and various state-of-the-art models.

General structure

  • ResNet has different versions like ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, where the general structure is the same, but the network is deeper. We need to understand that the number stands for the layer count, so ResNet-18 has 18 layers, ResNet-34 has 34 layers, etc. ResNet-34 is one of the winners of the competition, so we will review the structure based on this.

  • It’s training strategies are similar to other architectures. It has a learning rate initialized by 0.1, which is divided by 10 when the error stops improving, an SGD with a momentum of 0.9, L2 regularization with a coefficient of 0.0001, and a batch size of 256.

  • Batch normalization is used; dropout is not used.

  • The novelty of ResNet is that the architecture consists of the following.

Residual blocks

A residual block combines the identical input and output after convolution and activation functions.

Press + to interact
 Residual block: a collection of identity and main block
Residual block: a collection of identity and main block

Instead of moving forward with standard convolutional layers where the input of one layer is directly the output of the previous one, here, we keep the original input on one side and add this to the convolution output version of the same input. The connection bringing the original input data is also called an identity block or a shortcut connection.

Press + to interact
Deeper look at the residual block
Deeper look at the residual block

Bottleneck block

While residual blocks keep the feature map dimensions unchanged, another type of residual block is implemented for this architecture.

Bottleneck residual block: First, it squeezes the feature maps by decreasing the channel size using 1x1 convolution, then applies n×nn \times n convolution and increases the channel size (expansion) again by 1x1 convolution to obtain the exact size dimension of output with the identity block.

Press + to interact
Bottleneck residual block
Bottleneck residual block

The below image shows an 18-layer ResNet architecture.

Note: The solid line shortcuts present the combination of identity and convolved outputs applied directly since both parts have the same dimension (residual blocks). In contrast, the dotted shortcuts show that the identity and convolved outputs have different sizes, so they apply either zero padding or a 1x1 convolution to arrange the dimensions (residual blocks).

Press + to interact
Resnet18 architecture
Resnet18 architecture

Vanishing gradient

It’s time to take a step further and learn about the possible problems that can have undesirable or unsuccessful results while training our model. Besides knowing how to create the network, it’s also essential to understand how to deal with our training issues.

Depending on our goal, shallow neural networks might not be enough to learn our complex tasks, so we need deeper ones. The deeper our neural network is, the harder it is to train it! One of the main reasons for this is the vanishing gradient problem.

We already know we need the gradients to update our weights while moving backward in our network. Starting from the top of the network, the more into the back layer we go, the smaller the gradient we obtain. It means that until we arrive at the initial layers, we lose the gradient, and since it’s the gradient itself that should update our weights, we don’t update our weights, so the model doesn’t learn. Vanishing gradients are also known as dead gradients.

The following neural networks show what happens when carrying the signal in the forward pass using sigmoid activation functions and calculating the derivative repeatedly in the backward pass.

Note: The cost function was kept simple to make the calculations more understandable. Also, the true answer is considered 1. We don’t have to pay too much attention to the calculations but must remark on the decreasing gradient. The more we move backward, the less our gradient is, and it’s only three effortless layers; imagine how big this vanishing gradient problem is for deep neural networks.

Press + to interact
Forward pass with sigmoid function
Forward pass with sigmoid function

Here, aa is the input signal, w0w0 is the weight, b0b0 is the bias from the first to the second neuron, and the activation function ...

Access this course and 1400+ top-rated courses and projects.