YOLO is a popular series of deep-learning models designed for real-time object detection. The evolutionary journey of YOLO models can be traced through several major versions, each building upon the successes and shortcomings of its predecessor.
YOLOv1
The original YOLO model was a convolutional neural network (CNN) trained on the PASCAL VOC object detection dataset. YOLO adopted a unique approach by dividing the input image into a grid, with each grid cell responsible for predicting bounding boxes and associated class probabilities. One of the significant advantages of this model was its ability to use a singular CNN architecture to process the input image and make predictions in a single forward pass, making it notably fast. However, despite its innovative design, the YOLO model had its limitations. Relative to other models of its era, YOLO had a lower accuracy. Specifically, it faced challenges in detecting small objects and objects that were closely spaced.
YOLO(9000)v2
YOLOv2 was a significant improvement over YOLOv1 and introduced several new features:
Anchor boxes: This improved bounding box predictions and helped handle objects of different shapes and sizes.
New backbone: A new architecture called Darknet-19 was introduced as the backbone, which was faster and more efficient than the one used in v1.
Batch normalization: It was added to the CNN layers, which helped improve model stability and reduced overfitting.
Combined object detection and classification tasks: This allowed the model to detect over 9,000 object categories (the source of the name YOLO9000).
Image resolution agnostic: It removed fully connected layers used in YOLOv1, which made it fully convolutional. By removing these layers, authors were able to train the network with different image resolutions on the fly.
Get hands-on with 1300+ tech skills courses.