YOLOv7 (2022)
Learn the novelties and architecture of the YOLOv7.
In this lesson, we will discuss the novelties of YOLOv7 in detail:
- E-ELAN
- Compound scaling for concatenation blocks
- Planned reparameterized convolution
To understand these structures better, we need to first examine some more basic components.
Group convolution
We learned network-in-network architecture, in which the input feature map passes through different convolutions in one layer and is concatenated to build the single output feature map.
Suppose that we have different convolutions in a layer, but instead of applying them to the whole input feature map, we divide the input into groups, and each group goes to one specific convolution. Concatenating the output from each convolution, we obtain the output feature map. This process is called group convolution.
The above image visualizes the difference between network-in-network and group convolutions. In the left part, the input feature map passes through each convolution, and the output feature maps are concatenated to build the final output feature map. In the right part, the input feature map is divided into three groups: yellow, green, and blue. The yellow group passes through only a 1x1 convolution filter, green only through 3x3, and blue only through 5x5. Similarly, the output feature maps are concatenated to create the final feature map.
In the above image, we see that the output feature map of the first group convolution is shuffled so that each group in the new feature map contains a part of each three different groups. Now, these individual groups, created by shuffling the channels, pass through the second group convolution, where group one passes through only a 1x1 convolution, group two passes through only a 3x3 convolution, and group three only by 5x5.
Aggregation vs. concatenation
We already discussed the feature pyramids, residual blocks, and many other different structures that concatenate or aggregate the feature maps derived from parallel paths to obtain the final feature map of the layer. But what is the exact difference between aggregation and concatenation in convolutional operations?
Concatenation refers to merging two or more feature maps with their channel dimension, whereas aggregation prefers to merge them by applying a chosen multivariate function. In so many cases, this aggregation function is a simple summation.
In the above image, we observe that both operations use the same symbol. This is common, and when we examine the structure, it’s better to examine the explanation and also be sure if it means summation or concatenation. We see that the left part of the image visualizes the concatenation where two feature maps merged along their channel dimension, and the right part shows summation where it’s a simple sum operation along the pixels. If we sum two tensors like this, they should be the same size and dimensions.
Extended ELAN (E-ELAN)
Extended ELAN (E-ELAN)—the backbone of YOLOv7—is a structure that uses group convolutions, channel shuffle, and cross-stage partial structures to design an efficient network by controlling the shortest and the longest gradient path so that deeper networks can converge and learn effectively.
First, let’s take a look at the Efficient Layer Aggregation Network (ELAN) architecture, which is the basis of E-ELAN.
In the above image, the input channel of the convolution layer is shown as blue and the output as red. We see that an input feature map having a channel is sent through a cross-stage partial network, where two parts are sent directly, and one part passes through 3x3 convolutions, entering and outputting as ...