Using PyTorch for Image Classification and Object Detection/

...

InceptionV1—GoogleNet (2014)

Learn the fundamentals of InceptionV1 (also called GoogleNet) image classification architecture, along with the network-in-network method.

We'll cover the following...

General structure
Network-in-network
Auxiliary classifiers
1x1 convolution
InceptionV2 and InceptionV3
Comparison
Backbone and head
Creating an Inception module

General structure

InceptionV1 is the image classification architecture that won the LSVRC competition in 2014.

It has a 22-layer architecture that uses the network-in-network approach for some layers that they call Inception modules.
It’s training strategies are similar to other architectures. It has an SGD with a momentum of 0.9, fixed learning rate decreasing by 4% every 8 epochs, drop out at the fully connected layers with a rate of 0.4, ReLU activation function in Inception modules, and softmax at the end.
Average pooling is applied between the final convolution layer and fully connected ones.
Instead of having one fully connected head, they have three. They call two additional fully connected extensions auxiliary classifiers. The exciting part is they use these three classifier heads during training and take the average of the final weights of these different classifier heads to obtain the final and unique head to use alone in inference.

Network-in-network

The main logic of network-in-network layers is to apply the different sizes of convolutions to the same input and concatenate the outcoming feature maps to obtain the final output from 1 layer. This approach provides feature maps with different scales from just the same input and increases the variety of the information coming from the input image. Therefore, it widens the learning capacity of the model with different scales from a given input.

In this logic, any network-in-network can be created with varying filters of convolution. The model calls the special layers using the network-in-network approach as Inception modules. Its structure is as follows:

Press + to interact

Auxiliary classifiers

Apart from the main classifier head at the end of the model, they create two extensions to make predictions from different scales and call these additional parts auxiliary classifiers. An auxiliary classifier’s structure is as follows:

An average pooling layer with 5×5 filter size and stride 3, resulting in a 4×4×512 output for the first auxiliary extension and 4×4×528 for the second one.
A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation (ReLU).
A fully connected layer with 1024 units and rectified linear activation function (ReLU) with a dropout having a 70% ratio (40% in the main classifier head.)
A fully connected layer with softmax activation function as the classifier, predicting the same 1000 classes as the primary classifier.

Press + to interact

Before We Start

Basics of Convolutional Neural Networks

Cats vs Dogs Classification with Convolutional Neural Networks

Popular Neural Network Architectures for Image Classification

Using PyTorch for Image Classification

Model Deployment

Using a PyTorch Model in JavaScript with ONNX

Basics of Object Detection

Two-Stage Object Detection Architectures

One-Stage Object Detection Architectures

YOLOv7 Model Train and Inference on Edge

Conclusion

Appendix

Building a System for Safety Helmet Detection Based on YOLOv5

InceptionV1—GoogleNet (2014)

General structure

Network-in-network

Auxiliary classifiers