Using PyTorch for Image Classification and Object Detection/

...

R-CNN (2014)

Learn about the R-CNN, its architecture, and its usage area.

We'll cover the following...

ROI extraction
- Selective search
Feature extraction using CNNs
Head with two branches
Flowchart
Weak points of R-CNN
Training strategies
- Ground truths (labels) of an object detection dataset

Press + to interact

Note: The two-stage term might imply that the model performs only two steps. When we mention that an object detection model is two-stage, we mean that the object proposal step and classifier step are individual, so this part happens in two steps. Before proceeding to the object proposal and classification steps, we still need a backbone to extract features and some additional steps depending on the architecture.

ROI extraction

First, we apply the selective search algorithm to extract region proposals. This segmentation method distinguishes foreground pixels from background ones.

Note: You might encounter the terms foreground pixel, foreground region, background pixel, or background region terms while looking at literature. We use foreground terminology to talk about the significant parts of the image carrying features that can belong to an object. Otherwise, it’s a background region.

Press + to interact

Selective search

Selective search, being a classical segmentation approach, applies the following steps:

It calculates the similarities between all neighboring regions.
It groups the two most similar regions together and then calculates the new similarities between the resulting region and its neighbors.
This process repeats until the whole object is covered in a single region. In other words, the areas stop growing when no similar regions concatenate.

Selective search recursively combines these groups of regions into larger ones to determine what could be an object. While doing that, it creates 2,000 areas to be investigated, also called 2000 ROIs! Note that the number of ROIs does not depend on the dataset; it’s the fixed amount of the selective search extracts.

Feature extraction using CNNs

After extracting almost 2000 possible ROIs, which might have an object according to the segmentation, CNN is applied to all these boxes one by one to extract the features to send them to the head at the next step. It’s unconventional! Imagine we have one image; instead of sending it directly, we send 2,000 cropped images (ROIs) to our convolutional neural network.

Note: We keep every ROI’s four coordinates in a separate array so as no to lose the connection between the feature maps of an ROI and where they come from in the original image.

Head with two branches

It’s time to obtain final predictions from the head of architecture. The head consists of two components: a classification layer and a bounding box regressor.

Press + to interact

Classification

We know what classification is; we need to understand how we use it for this architecture. We have 256 feature maps coming only for one ROI, and we send these maps directly to the classifier. The classifier gives us the predicted class; if, and only if, the prediction score is higher than our confidence threshold, we keep the ROI in the related class’s array. All the classes have their own arrays; it’s nothing more than programming detail to keep the objects in good order.

Note: Remember what precision score is. The model gives us probabilities for each class; we select the class having the maximum probability as the predicted one. The probability for this chosen class is our prediction score. Consider we set the confidence threshold to 0.9; in that case, only the ROIs having a prediction score >= 0.9 can pass to the bounding box regressor.

Bounding box regressor

It is time to postprocess the coordinates of this ROI to obtain our final bounding box. We use a bounding box regressor, which takes the feature maps as input like the classifier but predicts four coordinates instead of a class.

The bounding box regressor is a fully connected neural network trained using ground truth coordinates. It has the same logic as classifier training.

Note: We use a class-specific bounding box regressor. If the ROI is classified as a cat, it goes to the cat’s bounding box regressor; if it’s a dog, it goes to the dog class’ bounding box regressor.

Imagine we have a class pencil and ...

Before We Start

Basics of Convolutional Neural Networks

Cats vs Dogs Classification with Convolutional Neural Networks

Popular Neural Network Architectures for Image Classification

Using PyTorch for Image Classification

Model Deployment

Using a PyTorch Model in JavaScript with ONNX

Basics of Object Detection

Two-Stage Object Detection Architectures

One-Stage Object Detection Architectures

YOLOv7 Model Train and Inference on Edge

Conclusion

Appendix

Building a System for Safety Helmet Detection Based on YOLOv5

R-CNN (2014)

ROI extraction

Selective search

Feature extraction using CNNs

Head with two branches

Classification

Bounding box regressor