Architecture

SSD uses VGG16 as the backbone to extract the features of the input image, and meanwhile, sending these features to the classifier directly, it continues to apply different scales of convolutional layers on these feature maps to obtain the detection predictions from different sizes and scales. In this essence, it has six additional convolutional layers following the feature extractor, and each feature map coming from a different scale is sent to the classifier to obtain detection results.

We use four anchor boxes at the first and last two levels of feature maps; meanwhile, we use six anchor boxes for the other three feature maps.

Finally, we obtain an object detection architecture having predictions from six different scales, producing 8732 detections per class.

Similarly to the two-stage object detectors, non-maximum suppression is applied to eliminate the predictions showing the same object. Since we have 8732 predictions only for one class, we realize that without non-maximum suppression, it’s impossible not to have boxes crashing around an object.

Press + to interact

Detection from different scales is a great approach and probably strengthens the architecture. On the other hand, having that many detections may cause an inefficiency in the model speed. But again, why is it exactly 8732 detections?

We know that every single point in a feature map is the center of anchor boxes, and $n$ anchor boxes from different sizes and scales are created based on this center point. Let’s mathematically illustrate this:

The first feature map coming from VGG16 is $38 \times 38$ in width and height, and for this feature map, $4$ anchor boxes are created for each center point.
The second feature map is $19 \times 19$ , and $6$ anchor boxes are created for each center point.
The third feature map is $10 \times 10$ , and $6$ anchor boxes are created for each center point.
The fourth feature map is $5 \times 5$ ...

Before We Start

Basics of Convolutional Neural Networks

Cats vs Dogs Classification with Convolutional Neural Networks

Popular Neural Network Architectures for Image Classification

Using PyTorch for Image Classification

Model Deployment

Using a PyTorch Model in JavaScript with ONNX

Basics of Object Detection

Two-Stage Object Detection Architectures

One-Stage Object Detection Architectures

YOLOv7 Model Train and Inference on Edge

Conclusion

Appendix

Building a System for Safety Helmet Detection Based on YOLOv5

SSD (2015)

Architecture