Using PyTorch for Image Classification and Object Detection/

...

/

Faster R-CNN (2016)

Faster R-CNN is an improved version of Fast R-CNN architecture published only one year later. Even though Fast RCNN is way faster than RCNN, it’s still not a real-time object detector. Additionally, the classifier was changed to be a fully connected layer. But still, the selective search remains an external and independent module! Therefore, we can’t say Fast RCNN is an end-to-end unified and trainable network.

Faster RCNN touches on these two issues and publishes as the newest and best member of region-based convolutional neural networks.

Improvements

The most significant improvement is the region proposal part. In this architecture, we can use a region proposal network instead of a selective search algorithm.

Region proposal network (RPN)

A region proposal network (RPN) is a fully convolutional neural network connected to the backbone (the feature extraction layers used in R-CNN and Fast R-CNN). It takes the feature maps from the backbone as input and gives six outputs:

Objectness score as two classes (class1:object, class2:not an object)
Four coordinates of the proposed regions

The objectness score is to decide the possibility of having an object inside of this box. So besides the region coordinates, different from previous R-CNN versions, we also obtain an idea about the proposed region before sending it to the classifier. Although we can consider this branch as a logistic regression problem and use sigmoid to produce one output, it’s preferred to use softmax probability and create two probability outputs: the probability of having an object inside and the second is the probability of not having it.

The logic behind RPN is quite interesting, though. We can resume it in three steps:

It takes the feature maps from the backbone, with a size of $width \times height \times 512$ , and applies a convolution which is also called an $n \times n$ sliding window.
Every pixel in the output feature map from this convolution is the center of $k$ anchor boxes. In other words, we extract $k$ fixed-size boxes from each center point in this output feature map.
We send these boxes to a 1x1 convolutional layer to obtain the final regression output for four coordinates of all boxes produced in the previous step. The channel of the convolutional layer is $4\times k$ where $k$ is the anchor number with four coordinates. Similarly, we send the same anchor boxes to the classifier convolutional network to obtain the objectness score. Therefore, the channel of the classifier convolutional network is $2 \times k$ .

Faster R-CNN (2016)

Improvements

Region proposal network (RPN)

Anchors

Advantages of anchors