CNN Architectures

Explore the most innovative CNN architectures (available from 2012 until 2021) and their principles.

We'll cover the following...

In this section, we will discuss CNN architectures that stood the test of time. Even though not all of them are still used in recent top-performing architectures, it is important to study them and understand their intuitions.

AlexNet

AlexNet is made up of 5 conv layers starting from an 11x11 kernel. It was the first architecture that employed max-pooling layers, Relu activation functions, and dropout for the 3 enormous linear layers. The network was used for image classification with 1000 possible classes, which for that time was madness (it was introduced in 2012). Now you can implement it in 35 lines of Pytorch code

It was the first convolutional model that was successfully trained on Imagenet, a dataset with 1M training images of 1000 classes.

Press + to interact
import torch
import torch.nn as nn
class AlexNet(nn.Module):
def __init__(self, num_classes: int = 1000) -> None:
super(AlexNet, self).__init__()
self.features = nn.Sequential(
nn.Conv2d(3, 64, kernel_size=11, stride=4, padding=2),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(64, 192, kernel_size=5, padding=2),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
nn.Conv2d(192, 384, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(384, 256, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.Conv2d(256, 256, kernel_size=3, padding=1),
nn.ReLU(inplace=True),
nn.MaxPool2d(kernel_size=3, stride=2),
)
self.avgpool = nn.AdaptiveAvgPool2d((6, 6))
self.classifier = nn.Sequential(
nn.Dropout(),
nn.Linear(256 * 6 * 6, 4096),
nn.ReLU(inplace=True),
nn.Dropout(),
nn.Linear(4096, 4096),
nn.ReLU(inplace=True),
nn.Linear(4096, num_classes),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = self.features(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.classifier(x)
return x
model = AlexNet(num_classes=10)
inp = torch.rand(1,3,128,128)
print(model(inp).shape)

VGG

The famous paper “Very Deep Convolutional Networks for Large-Scale Image Recognition” made the term “deep” viral. It was the first study that provided undeniable evidence that simply adding more layers increases performance. Nonetheless, this assumption holds true up to a certain point. The authors used only 3x3 kernels, as opposed to AlexNet. The architecture was trained using 224 × 224 RGB images.

The main principle is that a stack of three 3×33 \times 3 conv layers are similar to a single 7×77 \times 7 layer. And maybe even better because they use three non-linear activations in between (instead of one), which makes the function more discriminative.

Secondly, this design decreases the number of parameters. Specifically, you need 3(32)C2=27×C23*(3^2)C^2 = 27 \times C^2 ...