...

/

Basics of the Convolutional Neural Network (CNN): Part I

Basics of the Convolutional Neural Network (CNN): Part I

Learn about the fundamentals of a CNN, a building block for YOLO.

A convolutional neural network (CNN) tries to detect multiple patterns in different regions of an image using a receptive field, which is the area that a neuron sees when processing data. Here are some key features of a CNN:

  • A CNN is a special type of neural network—generally used for image data—that can extract features from an image so that the computer can identify its content.

  • The intuition behind CNN is to reduce the input size while increasing the depth (equal to the number of channels) in the network.

  • A CNN uses convolution instead of general matrix multiplication.

  • Instead of feeding pixels to a neural network, we feed features to CNN.

Why are CNNs needed over ANNs for images?

CNNs are preferred over ANNsAn artificial neural network (ANN) is a computational model inspired by the structure of biological neural networks, designed to recognize patterns and make decisions by processing input data through layers of interconnected nodes or “neurons.” for image processing tasks because they can effectively capture spatial hierarchies and local patterns in images due to their convolutional layers and weight-sharing architecture. This enables CNNs to learn more complex features while requiring fewer parameters, making them more efficient and accurate in handling image data.

Problems with neural networks

Let’s learn in detail why CNNs perform better than ANNs for image data.

Rotation/position invariance

Press + to interact
Conversion of an RGB image to a 1D array for a neural network input
Conversion of an RGB image to a 1D array for a neural network input

Images are just pixel values for computers. A neural network (NN) flattens out the image representation—converts it into a 1D array. The issue with this approach is that the spatial information of images is lost in this conversion. So, if we train a model with the data of a person at a specific location, an NN will fail to identify the person at another location in the same image.

Press + to interact
A location invariant in an NN
A location invariant in an NN

Not scalable

Neural networks follow a fully connected structure, meaning each input is connected to all the neurons in the next layer. This significantly increases the number of weights to be trained. For instance, let’s suppose we have an RGB image with dimensions of 8 × 8 pixels. In an NN, each pixel serves as an input to the network, resulting in a total of 8×8×3=1928\times8\times3 = 192 inputs and weights for our network.

But, if we want to take a larger image of 150 × 150 pixels for our model, the total number of inputs now will be 150×150×3=67,500150\times150\times3 = 67,500, which is immense. These numbers add up as we increase the number of layers in our NN, making our network complex, difficult to train, and prone to overfitting. As a result, it becomes unsuitable for larger images.

How do CNNs solve these problems?

CNNs use shared weights to control the number of parameters. The intuition behind this is that if a filter is useful for finding features at coordinates (xix_i, yiy_i), it should be able to compute features at the (xjx_j,yjy_j ...

Access this course and 1400+ top-rated courses and projects.