R-CNN (2014)
Learn about the R-CNN, its architecture, and its usage area.
We'll cover the following...
R-CNN (Region-based convolutional neural networks) is the first member of the R-CNN family to apply object detection on images with two separate stages. As mentioned briefly, step one is obtaining regression predictions for final bounding boxes with four coordinates. The second step is a simple classifier to get class predictions for these boxes. Let’s examine exactly how the R-CNN architecture works!
Note: The two-stage term might imply that the model performs only two steps. When we mention that an object detection model is two-stage, we mean that the object proposal step and classifier step are individual, so this part happens in two steps. Before proceeding to the object proposal and classification steps, we still need a backbone to extract features and some additional steps depending on the architecture.
ROI extraction
First, we apply the selective search algorithm to extract region proposals. This segmentation method distinguishes foreground pixels from background ones.
Note: You might encounter the terms foreground pixel, foreground region, background pixel, or background region terms while looking at literature. We use foreground terminology to talk about the significant parts of the image carrying features that can belong to an object. Otherwise, it’s a background region.
Selective search
Selective search, being a classical segmentation approach, applies the following steps:
-
It calculates the similarities between all neighboring regions.
-
It groups the two most similar regions together and then calculates the new similarities between the resulting region and its neighbors.
-
This process repeats until the whole object is covered in a single region. In other words, the areas stop growing when no similar regions concatenate.
Selective search recursively combines these groups of regions into larger ones to determine what could be an object. While doing that, it creates 2,000 areas to be investigated, also called 2000 ROIs! Note that the number of ROIs does not depend on the dataset; it’s the fixed amount of the selective search extracts.
Feature extraction using CNNs
After extracting almost 2000 possible ROIs, which might have an object according to the segmentation, CNN is applied to all these boxes one by one to extract the features to send them to the head at the next step. It’s unconventional! Imagine we have one image; instead of sending it directly, we send 2,000 cropped images (ROIs) to our convolutional neural network.
Note: We keep every ROI’s four coordinates in a separate array so as no to lose the connection between the feature maps of an ROI and where they come from in the original image.
Head with two branches
It’s time to obtain final predictions from the head of architecture. The head consists of two components: a classification layer and a bounding box regressor.
Classification
We know what classification is; we need to understand how we use it for this architecture. We have 256 feature maps coming only for one ROI, and we send these maps directly to the classifier. The classifier gives us the predicted class; if, and only if, the prediction score is higher than our confidence threshold, we keep the ROI in the related class’s array. All the classes have their own arrays; it’s nothing more than programming detail to keep the objects in good order.
Note: Remember what precision score is. The model gives us probabilities for each class; we select the class having the maximum probability as the predicted one. The probability for this chosen class is our prediction score. Consider we set the confidence threshold to 0.9; in that case, only the ROIs having a prediction score >= 0.9 can pass to the bounding box regressor.
Bounding box regressor
It is time to postprocess the coordinates of this ROI to obtain our final bounding box. We use a bounding box regressor, which takes the feature maps as input like the classifier but predicts four coordinates instead of a class.
The bounding box regressor is a fully connected neural network trained using ground truth coordinates. It has the same logic as classifier training. ...