Faster R-CNN (2016)

Learn about the Faster R-CNN with the RPN network and anchors along with offset loss calculation.

Faster R-CNN is an improved version of Fast R-CNN architecture published only one year later. Even though Fast RCNN is way faster than RCNN, it’s still not a real-time object detector. Additionally, the classifier was changed to be a fully connected layer. But still, the selective search remains an external and independent module! Therefore, we can’t say Fast RCNN is an end-to-end unified and trainable network.

Faster RCNN touches on these two issues and publishes as the newest and best member of region-based convolutional neural networks.

Press + to interact
Fast R-CNN architecture
Fast R-CNN architecture

Improvements

The most significant improvement is the region proposal part. In this architecture, we can use a region proposal network instead of a selective search algorithm.

Region proposal network (RPN)

A region proposal network (RPN) is a fully convolutional neural network connected to the backbone (the feature extraction layers used in R-CNN and Fast R-CNN). It takes the feature maps from the backbone as input and gives six outputs:

  • Objectness score as two classes (class1:object, class2:not an object)

  • Four coordinates of the proposed regions

The objectness score is to decide the possibility of having an object inside of this box. So besides the region coordinates, different from previous R-CNN versions, we also obtain an idea about the proposed region before sending it to the classifier. Although we can consider this branch as a logistic regression problem and use sigmoid to produce one output, it’s preferred to use softmax probability and create two probability outputs: the probability of having an object inside and the second is the probability of not having it.

The logic behind RPN is quite interesting, though. We can resume it in three steps:

  1. It takes the feature maps from the backbone, with a size of width×height×512width \times height \times 512, and applies a convolution which is also called an n×nn \times n sliding window.

  2. Every pixel in the output feature map from this convolution is the center of kk anchor boxes. In other words, we extract kk fixed-size boxes from each center point in this output feature map.

  3. We send these boxes to a 1x1 convolutional layer to obtain the final regression output for four coordinates of all boxes produced in the previous step. The channel of the convolutional layer is 4×k4\times k where kk is the anchor number with four coordinates. Similarly, we send the same anchor boxes to the classifier convolutional network to obtain the objectness score. Therefore, the channel of the classifier convolutional network is 2×k2 \times k.

Anchors

Let’s take a closer look at the term anchors. Anchors were introduced in the Faster R-CNN architecture and became a widely used method later, especially in object detection models. Anchor boxes are not the proposed regions but fixed-size reference boxes to produce the region proposals.

“Fixed but different sizes and scales” means we determine kk different sizes & scale boxes for once, and during the training, the size of these reference boxes doesn’t change. In the original architecture, 9 anchor boxes with 3 different sizes and scales are used, just like visualized below.

Press + to interact
9 anchor boxes for one center point and 3 different sizes x 3 different scales
9 anchor boxes for one center point and 3 different sizes x 3 different scales

Advantages of anchors

...
Access this course and 1400+ top-rated courses and projects.