How does YOLO loss work?

Object detection is an important task in the computer vision domain. To handle object detection problems, deep learning models have become the go-to approach due to their performance. Within deep learning, You Only Look Once (YOLO) is one of the various techniques used for object detection problems. YOLO works by dividing the image into grid cells and detecting objects. Another popular technique is to predict the region of interest in the image and then detect objects in those regions. However, each technique requires different loss functions. In this Answer, we will focus on the loss function for YOLO.

YOLO loss function

The following equation is for the YOLO loss function:

This equation appears abstract, but for better understanding, we will break it piecewise into four equations (as numbered). However, before diving into its mathematics, let's build the necessary intuition.

Understanding the terms

The YOLO architecture divides an image into $S*S$ grid cells and predicts the $B$ bounding boxes and the $C$ class probabilities for each cell. Along with each bounding box, the model also predicts its confidence score. This means that each bounding box comes with $5$ values $(x,y,$ width, height, and confidence). Putting all of them together gives us $S*S*(B*5+C)$ size predictions. In the following figure, we assume that the image is divided into $3*3$ grid cells. Each cell gives us $2$ bounding box predictions ( $B=2$ ), and we have 4 classes ( $C=4$ ). The final output is shown in the following figure.

Connecting it altogether

Now, let's go back to the loss function. The loss function is the sum of:

Localization loss: This is represented by equations $(1)$ and $(2).$ For each box, it calculates the differences between the actual and predicted $(x,y)$ coordinates, and the actual and predicted width and height coordinates.
Objectness loss: This is represented by equation $(3).$ For each box, this computes the loss on whether the box contains any object by taking the differences between the actual and predicted confidence scores.
Classification loss: This is represented by equation $(4).$ For each predicted box, it calculates the difference in probabilities between the actual and predicted classes.

The $λ's$ in the equations are constants that modify the aspect of the loss function (i.e., penalize more or less).

Conclusion

In summary, the YOLO loss function can be broken down into the localization, objectness, and classification losses. Calculating the differences in these losses varies, but when put together, the sum of all these is the ultimate YOLO loss function.

Free Resources