Using PyTorch for Image Classification and Object Detection/

...

/

YOLOv7 (2022)

In this lesson, we will discuss the novelties of YOLOv7 in detail:

E-ELAN
Compound scaling for concatenation blocks
Planned reparameterized convolution

To understand these structures better, we need to first examine some more basic components.

Group convolution

We learned network-in-network architecture, in which the input feature map passes through different convolutions in one layer and is concatenated to build the single output feature map.

Suppose that we have different convolutions in a layer, but instead of applying them to the whole input feature map, we divide the input into groups, and each group goes to one specific convolution. Concatenating the output from each convolution, we obtain the output feature map. This process is called group convolution.

The above image visualizes the difference between network-in-network and group convolutions. In the left part, the input feature map passes through each convolution, and the output feature maps are concatenated to build the final output feature map. In the right part, the input feature map is divided into three groups: yellow, green, and blue. The yellow group passes through only a 1x1 convolution filter, green only through 3x3, and blue only through 5x5. Similarly, the output feature maps are concatenated to create the final feature map.

In the above image, we see that the output feature map of the first group convolution is shuffled so that each group in the new feature map contains a part of each three different groups. Now, these individual groups, created by shuffling the channels, pass through the second group convolution, where group one passes through only a 1x1 convolution, group two passes through only a 3x3 convolution, and group three only by 5x5.

Aggregation vs. concatenation

We already discussed the feature pyramids, residual blocks, and many other different structures that concatenate or aggregate the feature maps derived from parallel paths to obtain the final feature map of the layer. But what is the exact difference between aggregation and concatenation in convolutional operations?

Concatenation refers to merging two or more feature maps with their channel dimension, whereas aggregation prefers to merge them by applying a chosen multivariate function. In so many cases, this aggregation function is a simple summation.

In the above image, we observe that both operations use the same symbol. This is common, and when we examine the structure, it’s better to examine the explanation and also be sure if it means summation or concatenation. We see that the left part of the image visualizes the concatenation where two feature maps merged along their channel dimension, and the right part shows summation where it’s a simple sum operation along the pixels. If we sum two tensors like this, they should be the same size and dimensions.

Extended ELAN (E-ELAN)

Extended ELAN (E-ELAN)—the backbone of YOLOv7—is a structure that uses group convolutions, channel shuffle, and cross-stage partial structures to design an efficient network by controlling the shortest and the longest gradient path so that deeper networks can converge and learn effectively.

First, let’s take a look at the Efficient Layer Aggregation Network (ELAN) architecture, which is the basis of E-ELAN.

Before We Start

Basics of Convolutional Neural Networks

Cats vs Dogs Classification with Convolutional Neural Networks

Popular Neural Network Architectures for Image Classification

Using PyTorch for Image Classification

Model Deployment

Using a PyTorch Model in JavaScript with ONNX

Basics of Object Detection

Two-Stage Object Detection Architectures

One-Stage Object Detection Architectures

YOLOv7 Model Train and Inference on Edge

Conclusion

Appendix

Building a System for Safety Helmet Detection Based on YOLOv5

YOLOv7 (2022)

Group convolution

Aggregation vs. concatenation

Extended ELAN (E-ELAN)