YOLOX architecture

The YOLOX architecture takes one step back. Instead of following the YOLOv4 path, it continues from YOLOv3 as the base structure and improves it differently than YOLOv4.

YOLOX has three major novelties: decoupled head, anchor-free, and advanced label-assigning strategy.

Decoupled head

We know that an object detection problem consists of two tasks: classification and localization, in other words, classification and regression problems. Until now, we saw that both tasks are solved in a single head, whether the model is two-stage or one-stage. YOLOX adds a new approach to the YOLO family: decoupling the head and solving each task on its own.

Press + to interact

The above figure shows us the difference between coupled and decoupled heads where the coupled head takes the last feature map coming from the backbone and applies convolution to produce class scores, objectness scores, and localization results in one head proportional to the number of anchor boxes used for the model. Each channel in the head’s convolutional layer represents the weights to solve one of these tasks. In the decoupled head, the last feature map follows two parallel paths: one for the classification task and one to be decoupled one more time to solve objectness score and localization problems individually.

<b>Note</b>: YOLOX is an anchor-free model, so we will see how it works without using anchor boxes. The decoupled heads of YOLOx don’t contain as many channel dimensions as anchor boxes but rather only one per task.

Anchor-free

YOLOX doesn’t use anchor boxes and breaks the rule after YOLOv2, v3, and v4. Remembering that YOLO—the very first member of the family—didn’t use any anchor boxes, we should know that YOLOX still follows the similar architecture of YOLOv3, i.e., not having fully connected layers at the end like YOLO. To not break the structure that much from YOLOv3, we adapt the anchor-free mechanism in a very simple way: pretend like we have one anchor box instead of three for each level of the feature map since, in the end, it’s as same as writing $H \times W \times Class \ amount * anchor \ box$ ...

Before We Start

Basics of Convolutional Neural Networks

Cats vs Dogs Classification with Convolutional Neural Networks

Popular Neural Network Architectures for Image Classification

Using PyTorch for Image Classification

Model Deployment

Using a PyTorch Model in JavaScript with ONNX

Basics of Object Detection

Two-Stage Object Detection Architectures

One-Stage Object Detection Architectures

YOLOv7 Model Train and Inference on Edge

Conclusion

Appendix

Building a System for Safety Helmet Detection Based on YOLOv5

YOLOX (2021) and YOLOv6 (2022)

YOLOX architecture

Decoupled head

Anchor-free