Object Detection

Learn about the YOLO architecture and the working of VGG16.

We'll cover the following

YOLO architecture

We have so far only discussed object recognition. In many applications, we want to go further and also tell where the objects are in the picture. For example, for self-driving cars, we want to know where pedestrians are or where the road is. One way of doing this is to place bounding boxes around the objects, as shown in the figure below. A popular architecture for this is called YOLO (You Only Look Once). The idea is thereby to train a network not only on single labels but also on the location (x,y)(x, y), the size (w,h)(w, h) of a bounding box, and its confidence.

The network does this by dividing an image into an array of grid cells of size S×SS \times S, where SS is set to S=7S = 7 in the original example. The network makes BB number of predictions of the five numbers mentioned earlier (x,y,w,h,conf)(x, y, w, h, conf) for each bounding box (B=2B = 2 in the original example) so that we need S×S×(B5+C) S \times S \times(B * 5+C) output nodes. Here, CC is the number of classes, which was C=20C = 20 in the dataset in the original paper, hence the output shape of 7×t×307\times t \times 30.

Get hands-on with 1400+ tech skills courses.