Understanding Anchor Boxes: Part II
Learn how anchor boxes are calculated.
We'll cover the following...
How do anchor boxes work?
Anchor boxes are predefined bounding boxes of various shapes and sizes that help detect objects with different aspect ratios by adjusting and refining their dimensions during training to match the ground truth boxes closely. Let’s learn how they work in a pipeline.
Calculating the size of an anchor box
Picking up anchors that represent our data is extremely important because YOLO learns to make adjustments to these anchor boxes to predict a bounding box for an object. Here are the steps we need to follow to calculate the anchor box size:
Get bounding boxes’ dimensions from the training data: Since we need to find out the height and width of the anchors, we first determine the height and width of all the bounding boxes in the training data.
Cluster the bounding boxes: YOLO employs a grid-based approach for object detection. To illustrate, in YOLOv3, an image of 416 × 416 dimensions is partitioned into three grids of sizes 13 × 13, 26 × 26, and 52 × 52.
Let’s consider that we have three anchor boxes for each grid cell. Given that YOLO makes predictions at three scales—small, medium, and large— this means that we have a total of nine anchor boxes (three boxes per scale).
Now, the question is how are these nine anchors assigned to the three grids? The assignment process depends on the size of the anchor boxes as follows:
The three largest anchor boxes are assigned to the grid with the largest cells.
Conversely, the three smallest anchor boxes are allocated to the grid with the smallest cells.
Metrics used in k-means clustering
Instead of using the Euclidean distance as a metric, anchor boxes employ IoU scores. The aim is to maximize IoU scores for more precise predictions. To determine these initial anchor boxes, we use k-means clustering, which groups the bounding boxes according to their size and aspect ratios. The centroids of these clusters then become our initial anchor boxes, giving us a useful starting point for object detection. ...