...

/

MobileNetV1 (2017) and MobileNetV2 (2018)

MobileNetV1 (2017) and MobileNetV2 (2018)

Dive into the essentials of MobileNetV1 and MobileNetV2 architectures, including depthwise separable convolutions, model scaling, and bottleneck features in this lesson.

General structure

MobileNet model is designed for mobile applications and is TensorFlow’s first mobile computer vision model. Since it is designed for mobile applications, the architecture focuses on efficiently creating smaller sizes and less computational cost models. Let’s take a look at the general properties of the model:

  • It is a 28-layer architecture with various sizes of convolutional layers and fully connected ones at the end.

  • It has similar training strategies to the models we examined before. ReLU activation function is applied after each layer following batch normalization, except for the last laser using softmax.

  • It uses a width multiplier to scale channels of input and output feature maps and a resolution multiplier to scale the input data size.

  • Apart from standard convolutions, it uses depthwise separable convolutions to reduce the computational cost and model size.

Depthwise separable convolutions

The main logic of depthwise separable convolutions comes from the idea that we can separate a convolution filter to its kernels (so we can separate them in depthwise) and apply them one by one to obtain one feature map from each instead of using them together and get one final output from all the kernels in the filter.

The following images show a regular convolution operation with 3x3x3 filters and 128 channels as a quick reminder. One filter produces a 5x5x1 output feature map, and when we apply 128 filters, we obtain a 5x5x128 output feature map.

Press + to interact
Standard 2D convolution with 1 filter (left) vs. 128 filters (right)
Standard 2D convolution with 1 filter (left) vs. 128 filters (right)

Now let’s look at how depthwise convolution seems for the same size filter (3x3x3). Each kernel of the filter is applied individually and produces its own single-channel output feature map. As a result, we have a 5x5x3 output feature map this time.

Press + to interact
Depthwise convolution
Depthwise convolution

What would we do to obtain the same 128-channel output at the end? As in previous structures, a 1x1 convolution comes again to match the dimension. We apply a 1x1x3 convolution 128 times to our output coming from depthwise convolution and obtain a 5x5x128 feature map as an output.

Press + to interact
Pointwise convolution with 1 filter (left) vs. with 128 filters (right)
Pointwise convolution with 1 filter (left) vs. with 128 filters (right)

To sum up, a depthwise separable convolution consists of two steps:

  1. Depthwise convolution
  2. Pointwise convolution (1x1 convolution)
Press + to interact
Depthwise separable convolution pipeline: Depthwise convolution followed by pointwise convolution
Depthwise separable convolution pipeline: Depthwise convolution followed by pointwise convolution

FLOPs of a depthwise separable convolution

As mentioned above, MobileNet architecture was created primarily to have small-size and efficient computational cost models that could be embedded in mobile applications. So, there should be an excellent dimension reduction in depthwise separable convolutions.

The usual convolution’s FLOPs in the example above would be: 128×3×3×3×5×5=86400128\times3\times3\times3\times5\times5=86400 ...