InceptionV1—GoogleNet (2014)

Learn the fundamentals of InceptionV1 (also called GoogleNet) image classification architecture, along with the network-in-network method.

General structure

InceptionV1 is the image classification architecture that won the LSVRC competition in 2014.

  • It has a 22-layer architecture that uses the network-in-network approach for some layers that they call Inception modules.

  • It’s training strategies are similar to other architectures. It has an SGD with a momentum of 0.9, fixed learning rate decreasing by 4% every 8 epochs, drop out at the fully connected layers with a rate of 0.4, ReLU activation function in Inception modules, and softmax at the end.

  • Average pooling is applied between the final convolution layer and fully connected ones.

  • Instead of having one fully connected head, they have three. They call two additional fully connected extensions auxiliary classifiers. The exciting part is they use these three classifier heads during training and take the average of the final weights of these different classifier heads to obtain the final and unique head to use alone in inference.

Network-in-network

The main logic of network-in-network layers is to apply the different sizes of convolutions to the same input and concatenate the outcoming feature maps to obtain the final output from 1 layer. This approach provides feature maps with different scales from just the same input and increases the variety of the information coming from the input image. Therefore, it widens the learning capacity of the model with different scales from a given input.

In this logic, any network-in-network can be created with varying filters of convolution. The model calls the special layers using the network-in-network approach as Inception modules. Its structure is as follows:

Press + to interact
Inception module: naive version
1 / 2
Inception module: naive version

Auxiliary classifiers

Apart from the main classifier head at the end of the model, they create two extensions to make predictions from different scales and call these additional parts auxiliary classifiers. An auxiliary classifier’s structure is as follows:

  • An average pooling layer with 5×5 filter size and stride 3, resulting in a 4×4×512 output for the first auxiliary extension and 4×4×528 for the second one.

  • A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation (ReLU).

  • A fully connected layer with 1024 units and rectified linear activation function (ReLU) with a dropout ...

Access this course and 1400+ top-rated courses and projects.