ResNet (2015)
Learn the fundamentals of the ResNet image classification architecture with vanishing and exploding gradient problems.
Resnet is the image classification architecture that won the LSVRC competition in 2015. This model’s novel approach to a basic CNN structure has a wide range of usage in future architectures and various state-of-the-art models.
General structure
-
ResNet has different versions like ResNet-18, ResNet-34, ResNet-50, ResNet-101, and ResNet-152, where the general structure is the same, but the network is deeper. We need to understand that the number stands for the layer count, so ResNet-18 has 18 layers, ResNet-34 has 34 layers, etc. ResNet-34 is one of the winners of the competition, so we will review the structure based on this.
-
It’s training strategies are similar to other architectures. It has a learning rate initialized by 0.1, which is divided by 10 when the error stops improving, an SGD with a momentum of 0.9, L2 regularization with a coefficient of 0.0001, and a batch size of 256.
-
Batch normalization is used; dropout is not used.
-
The novelty of ResNet is that the architecture consists of the following.
Residual blocks
A residual block combines the identical input and output after convolution and activation functions.
Instead of moving forward with standard convolutional layers where the input of one layer is directly the output of the previous one, here, we keep the original input on one side and add this to the convolution output version of the same input. The connection bringing the original input data is also called an identity block or a shortcut connection.
Bottleneck block
While residual blocks keep the feature map dimensions unchanged, another type of residual block is implemented for this architecture.
Bottleneck residual block: First, it squeezes the feature maps by decreasing the channel size using 1x1 convolution, then applies convolution and increases the channel size (expansion) again by 1x1 convolution to obtain the exact size dimension of output with the identity block.
The below image shows an 18-layer ResNet architecture.
Note: The solid line shortcuts present the combination of identity and convolved outputs applied directly since both parts have the same dimension (residual blocks). In contrast, the dotted shortcuts show that the identity and convolved outputs have different sizes, so they apply either zero padding or a 1x1 convolution to arrange the dimensions (residual blocks).
Vanishing gradient
It’s time to take a step further and learn about the possible problems that can have undesirable or unsuccessful results while training our model. Besides knowing how to create the network, it’s also essential to understand how to deal with our training issues.
Depending on our goal, shallow neural networks might not be enough to learn our complex tasks, so we need deeper ones. The deeper our neural network is, the harder it is to train it! One of the main reasons for this is the vanishing gradient problem.
We already know we need the gradients to update our weights while moving backward in our network. Starting from the top of the network, the more into the back layer we go, the smaller the gradient we obtain. It means that until we arrive at the initial layers, we lose the gradient, and since it’s the gradient itself that should update our weights, we don’t update our weights, so the model doesn’t learn. Vanishing gradients are also known as dead gradients.
The following neural networks show what happens when carrying the signal in the forward pass using sigmoid activation functions and calculating the derivative repeatedly in the backward pass.
Note: The cost function was kept simple to make the calculations more understandable. Also, the true answer is considered 1. We don’t have to pay too much attention to the calculations but must remark on the decreasing gradient. The more we move backward, the less our gradient is, and it’s only three effortless layers; imagine how big this vanishing gradient problem is for deep neural networks.
Here,