Fast R-CNN (2015)
Explore the concept of Fast R-CNN, its innovations enhancing the RCNN architecture, and the use of accuracy metrics for evaluating object detection models.
Fast RCNN is an improved version of RCNN architecture. It tries to address the weaknesses of RCNN architecture, which is comparatively slow and not as end-to-end trainable as the whole architecture. Let’s see how Fast R-CNN handles the shortcomings of the R-CNN architecture.
Improvements
R-CNN doesn’t apply the backbone directly to the image; it applies the backbone (AlexNet to extract the features) to 2000 extracted regions after the selective search. This makes it slow.
To resolve this problem, we change the order of selective search and its backbone. The feature extraction layer is applied directly to the input image, and then the selective search is applied to the output feature maps.
Secondly, the classifier branch in the head is replaced with fully connected layers using the softmax function. We don’t use SVM anymore, which gives us the ability to train the backbone together with the head. Even though the selective search is still there and the model is not end-to-end trainable, it’s more uniform than RCNN.
Note: Remember that the backbone refers to the main feature extraction part of a model, placed in the beginning stages; meanwhile, the head is the end point of the model handling the final prediction tasks.
Training strategies
We use similar approaches to the previous models:
SGD with momentum = 0.9
Initialize the weights in the head branches with Gaussian distribution mean = 0.01, standard deviation = 0.001
Transfer learning in the backbone
Apply L2 regularization.
Apply NMS after the classifier
Multitask loss
In image classification models, we have only one branch in the head to train the architecture. The R-CNN head has two branches, but only the regression branch contributes to the model training. Since Fast R-CNN aims to have a more end-to-end trainable network, the target is to train the whole model together, using both branches coming from the head. How would we be able to obtain a unique loss while we have two branches in the head, producing their own losses?
The model combines two loss functions (regression loss and classification loss) to obtain a common loss function, a process called multitask loss.
We have two branches with separate loss functions. Each loss function is compatible with their task, and we send an image through the networks to generate the results from both branches. After that, we sum the losses of both branches to update the weights in the head and the backbone.
is the loss for classification, and is the loss for regression. Now we find the multitask loss for Fast R-CNN by the formula given below:
- : Ground truth class
- : Ground truth bounding box regression target