Deep Dive into Object Detection with YOLO/

...

Basics of the Convolutional Neural Network (CNN): Part I

Learn about the fundamentals of a CNN, a building block for YOLO.

We'll cover the following...

Why are CNNs needed over ANNs for images?
- Problems with neural networks
  - Rotation/position invariance
  - Not scalable
- How do CNNs solve these problems?
  - Explanation of the equation
  - Why is image size not relevant in the above calculation?

A convolutional neural network (CNN) tries to detect multiple patterns in different regions of an image using a receptive field, which is the area that a neuron sees when processing data. Here are some key features of a CNN:

A CNN is a special type of neural network—generally used for image data—that can extract features from an image so that the computer can identify its content.
The intuition behind CNN is to reduce the input size while increasing the depth (equal to the number of channels) in the network.
A CNN uses convolution instead of general matrix multiplication.
Instead of feeding pixels to a neural network, we feed features to CNN.

Why are CNNs needed over ANNs for images?

CNNs are preferred over ANNsAn artificial neural network (ANN) is a computational model inspired by the structure of biological neural networks, designed to recognize patterns and make decisions by processing input data through layers of interconnected nodes or “neurons.” for image processing tasks because they can effectively capture spatial hierarchies and local patterns in images due to their convolutional layers and weight-sharing architecture. This enables CNNs to learn more complex features while requiring fewer parameters, making them more efficient and accurate in handling image data.

Problems with neural networks

Let’s learn in detail why CNNs perform better than ANNs for image data.

Rotation/position invariance

Press + to interact

Not scalable

Neural networks follow a fully connected structure, meaning each input is connected to all the neurons in the next layer. This significantly increases the number of weights to be trained. For instance, let’s suppose we have an RGB image with dimensions of 8 × 8 pixels. In an NN, each pixel serves as an input to the network, resulting in a total of $8\times8\times3 = 192$ inputs and weights for our network.

But, if we want to take a larger image of 150 × 150 pixels for our model, the total number of inputs now will be $150\times150\times3 = 67,500$ , which is immense. These numbers add up as we increase the number of layers in our NN, making our network complex, difficult to train, and prone to overfitting. As a result, it becomes unsuitable for larger images.

How do CNNs solve these problems?

CNNs use shared weights to control the number of parameters. The intuition behind this is that if a filter is useful for finding features at coordinates ( $x_i$ , $y_i$ ), it should be able to compute features at the ( $x_j$ , $y_j$ ) location as well. It solves two problems:

Unlike NNs, they allow networks to detect the same feature irrespective of the location, which makes it generalized.

Press + to interact

Introduction to Object Detection

Fundamentals for Understanding YOLO

Building a System for Safety Helmet Detection Based on YOLOv5

YOLOv7 Architecture

Improving Model Performance: Handling Overfitting/Underfitting

Dealing With Small Datasets In ML

Pre-Trained Models, Fine-Tuning, and Hyperparameters in OD

Sun Detection Using YOLOv8

Conclusion