Using PyTorch for Image Classification and Object Detection/

...

MobileNetV1 (2017) and MobileNetV2 (2018)

Dive into the essentials of MobileNetV1 and MobileNetV2 architectures, including depthwise separable convolutions, model scaling, and bottleneck features in this lesson.

We'll cover the following...

General structure
Depthwise separable convolutions
FLOPs of a depthwise separable convolution
Model scaling
- Width multiplier
- Resolution multiplier
Comparison
Creating a depthwise separable convolution block
MobileNetV2

General structure

MobileNet model is designed for mobile applications and is TensorFlow’s first mobile computer vision model. Since it is designed for mobile applications, the architecture focuses on efficiently creating smaller sizes and less computational cost models. Let’s take a look at the general properties of the model:

It is a 28-layer architecture with various sizes of convolutional layers and fully connected ones at the end.
It has similar training strategies to the models we examined before. ReLU activation function is applied after each layer following batch normalization, except for the last laser using softmax.
It uses a width multiplier to scale channels of input and output feature maps and a resolution multiplier to scale the input data size.
Apart from standard convolutions, it uses depthwise separable convolutions to reduce the computational cost and model size.

Depthwise separable convolutions

The main logic of depthwise separable convolutions comes from the idea that we can separate a convolution filter to its kernels (so we can separate them in depthwise) and apply them one by one to obtain one feature map from each instead of using them together and get one final output from all the kernels in the filter.

The following images show a regular convolution operation with 3x3x3 filters and 128 channels as a quick reminder. One filter produces a 5x5x1 output feature map, and when we apply 128 filters, we obtain a 5x5x128 output feature map.

Press + to interact

FLOPs of a depthwise separable convolution

As mentioned above, MobileNet architecture was created primarily to have small-size and efficient computational cost models that could be embedded in mobile applications. So, there should be an excellent dimension reduction in depthwise separable convolutions.

The usual convolution’s FLOPs in the example above would be: $128\times3\times3\times3\times5\times5=86400$ FLOPs

The depthwise separable convolution’s FLOPs (depthwise convolution FLOP + pointwise convolution FLOP):

$3\times3\times3\times1\times5\times5$ ...

Before We Start

Basics of Convolutional Neural Networks

Cats vs Dogs Classification with Convolutional Neural Networks

Popular Neural Network Architectures for Image Classification

Using PyTorch for Image Classification

Model Deployment

Using a PyTorch Model in JavaScript with ONNX

Basics of Object Detection

Two-Stage Object Detection Architectures

One-Stage Object Detection Architectures

YOLOv7 Model Train and Inference on Edge

Conclusion

Appendix

Building a System for Safety Helmet Detection Based on YOLOv5

MobileNetV1 (2017) and MobileNetV2 (2018)

General structure

Depthwise separable convolutions

FLOPs of a depthwise separable convolution