SSD (2015)
Learn the fundamentals of SSD one-stage object detectors.
We'll cover the following...
Single Shot MultiBox Detector (SSD) is one of the first one-stage object detectors. As we know, the architecture is designed to predict the localization (bounding box coordinates) and classification (probabilities for each class) in one step.
Let’s have a deeper look at the structure!
Architecture
SSD uses VGG16 as the backbone to extract the features of the input image, and meanwhile, sending these features to the classifier directly, it continues to apply different scales of convolutional layers on these feature maps to obtain the detection predictions from different sizes and scales. In this essence, it has six additional convolutional layers following the feature extractor, and each feature map coming from a different scale is sent to the classifier to obtain detection results.
We use four anchor boxes at the first and last two levels of feature maps; meanwhile, we use six anchor boxes for the other three feature maps.
Finally, we obtain an object detection architecture having predictions from six different scales, producing 8732 detections per class.
Similarly to the two-stage object detectors, non-maximum suppression is applied to eliminate the predictions showing the same object. Since we have 8732 predictions only for one class, we realize that without non-maximum suppression, it’s impossible not to have boxes crashing around an object.
Detection from different scales is a great approach and probably strengthens the architecture. On the other hand, having that many detections may cause an inefficiency in the model speed. But again, why is it exactly 8732 detections?
We know that every single point in a feature map is the center of anchor boxes, and anchor boxes from different sizes and scales are created based on this center point. Let’s mathematically illustrate this:
-
The first feature map coming from VGG16 is in width and height, and for this feature map, anchor boxes are created for each center point.
-
The second feature map is , and anchor boxes are created for each center point.
-
The third feature map is , and ...