How to do backpropagation in a neural network

In this Answer, we’ll see how we can backpropagate the loss in a neural network and update the parameters to make the model learn from the data. Consider the following neural network with a cross-entropy loss function:

Note that every layer in layers 2–5 has a weight matrix, and each node in the layers has an associated bias. The output of these layers is given by computing the sigmoid of the forward propagation equation:

Z = W^\text{T}\times X + b

out = \sigma(Z)

Where,

$W$ is the weight matrix of the layer
$X$ is the input from the previous layer
$b$ is the bias vector of the layer
$\sigma$ is the sigmoid activation function

The binary cross-entropy loss is given as follows:

Cost(y_i, p_i) = - \frac{1}{2 N} \times \sum_{1}^{N} \sum_{i=1}^{2}y_i log(p_i) + (1-y_i)log(1-p_i)

Where,

$y$ is the one-hot encoded actual class
$p$ is the one-hot encoded predicted class
$N$ is the sample size

Parameter updates

We need to perform parameter updates using the following equation:

w_{new} = w_{old} - α \times Δw

and,

Δw = \frac{\partial Cost}{\partial w}

Where,

$w$ is the parameter
$α$ is the learning rate

Therefore, we will backpropagate each parameter’s share in the cost/loss and update it by the fraction determined by the learning rate.

Last layer

Let’s obtain the derivatives ( $Δw$ ) for the layer five.

For the last layer:

W_5 = \begin{bmatrix} w_{68} & w_{69} \\ w_{78} & w_{79} \end{bmatrix}

\frac{\partial Cost_{total}}{\partial w_{68}} = \frac{\partial Cost_{total}}{\partial out_{o8}} \times \frac{\partial out_{o8}}{\partial{net_{o8}}}\times\frac{\partial net_{o8}}{\partial w_{68}}

where,

\frac{\partial net_{o8}}{\partial w_{68}} = out_{h6},

and,

\frac{\partial out_{o8}}{\partial{net_{o8}}} = out_{o8}(1-out_{o8}),

because,

Cost_{total} = - \frac{1}{2N} \sum^{for\ batch}(( y_1 log(out_{o8}) + (1-y_1)log(1-out_{o8}))

+ (y_2 log(out_{o9}) + (1-y_2)log(1-out_{o9}))),

we have,

\begin{align*} \frac{\partial Cost_{total}}{\partial out_{o8}} &= - \frac{1}{2N} \sum^{\text{for\ batch}} \left(\frac{y_{1}}{out_{o8}} - \frac{1-y_1}{1-out_{o8}}\right) \\ &= \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_1}{out_{o8}} + \frac{1-y_1}{1-out_{o8}}\right) \end{align*}.

Therefore we get the equation for computing $\Delta w_{68}$ :

\frac{\partial Cost_{total}}{\partial w_{68}} = \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_1}{out_{o8}} + \frac{1-y_1}{1-out_{o8}}\right) \times out_{o8}(1-out_{o8}) \times out_{h6}.

Likewise for $\Delta b_8$ , because:

\frac{\partial net_{o8}}{\partial b_{8}} = 1,

we get,

\begin{align*} \frac{\partial Cost_{total}}{\partial b_{8}} &= \frac{\partial Cost_{total}}{\partial out_{o8}} \times \frac{\partial out_{o8}}{\partial{net_{o8}}}\times\frac{\partial net_{o8}}{\partial b_{8}} \\ &= \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_1}{out_{o8}} + \frac{1-y_1}{1-out_{o8}}\right) \times out_{o8}(1-out_{o8}) \end{align*}.

This multiple is referred to as $\delta$ (delta), and we will refer to $\Delta b_8$ as $\delta_8$ .

Similarly, we can find $\Delta w_{69}$ by the equation below:

\frac{\partial Cost_{total}}{\partial w_{69}} = \frac{\partial Cost_{total}}{\partial out_{o9}} \times \frac{\partial out_{o9}}{\partial{net_{o9}}}\times\frac{\partial net_{o9}}{\partial w_{69}}

where,

\begin{align*} \frac{\partial Cost_{total}}{\partial out_{o9}} &= - \frac{1}{2N} \sum^{\text{for\ batch}} \left(\frac{y_2}{out_{o9}} - \frac{1-y_2}{1-out_{o9}}\right) \\ &= \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_2}{out_{o9}} + \frac{1-y_2}{1-out_{o9}}\right)\end{align*}.

and,

\frac{\partial out_{o9}}{\partial{net_{o9}}} = out_{o9}(1-out_{o9}),

and,

\frac{\partial net_{o9}}{\partial w_{69}} = out_{h6},

so $\Delta w_{69}$ becomes:

\frac{\partial Cost_{total}}{\partial w_{69}} = \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_2}{out_{o9}} + \frac{1-y_2}{1-out_{o9}}\right) \times out_{o9}(1-out_{o9}) \times out_{h6}.

Likewise for $\Delta b_9$ :

\frac{\partial Cost_{total}}{\partial b_{9}} = \frac{\partial Cost_{total}}{\partial out_{o9}} \times \frac{\partial out_{o9}}{\partial{net_{o9}}}\times\frac{\partial net_{o9}}{\partial b_{9}},

where,

\frac{\partial net_{o9}}{\partial b_{9}} = 1,

so,

\frac{\partial Cost_{total}}{\partial b_{9}} = \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_2}{out_{o9}} + \frac{1-y_2}{1-out_{o9}}\right) \times out_{o9}(1-out_{o9}).

We will refer to the multiple $\Delta b_9$ as $\delta_9$ .

Hidden layers

We need to take into account the contributions to the error by each node in a layer, to be able to find the gradient updates of weights of that node. This contribution for a nodes of the second last layer is:

\frac{\partial Cost_{total}}{\partial out_{h6}} = \frac{\partial Cost_{o8}}{\partial out_{h6}} + \frac{\partial Cost_{o9}}{\partial out_{h6}}.

where,

\begin{align*}\frac{\partial Cost_{o8}}{\partial out_{h6}} &= \frac{\partial Cost_{total}}{\partial out_{o8}} \times \frac{\partial out_{o8}}{\partial net_{o8}} \times \frac{\partial net_{o8}}{\partial out_{h6}}\\ &= \delta_8 \times w_{68} \end{align*},

and,

\begin{align*} \frac{\partial Cost_{o9}}{\partial out_{h6}} &= \frac{\partial Cost_{total}}{\partial out_{o9}} \times \frac{\partial out_{o9}}{\partial net_{o9}} \times \frac{\partial net_{o9}}{\partial out_{h6}}\\ &= \delta_9 \times w_{69} \end{align*},

Therefore,

\begin{align*} \frac{\partial Cost_{total}}{\partial out_{h6}} = \delta_8 \times w_{68} + \delta_9 \times w_{69} \end{align*}.

To compute $\Delta w_{46}$ :

\begin{align*} \frac{\partial Cost_{total}}{\partial w_{46}} = \frac{\partial Cost_{total}}{\partial out_{h6}} \times \frac{\partial out_{h6}}{\partial net_{h6}} \times \frac{\partial net_{h6}}{\partial w_{46}} \end{align*}

We can then compute:

\begin{align*} \frac{\partial Cost_{total}}{\partial out_{h4}} &= \frac{\partial Cost_{total}}{\partial out_{h6}} \times \frac{\partial out_{h6}}{\partial net_{h6}} \times \frac{\partial net_{h6}}{\partial out_{h4}} + \frac{\partial Cost_{total}}{\partial out_{h7}} \times \frac{\partial out_{h7}}{\partial net_{h7}} \times \frac{\partial net_{h7}}{\partial out_{h4}} \\ &= (\delta_8 \times w_{68} + \delta_9 \times w_{69}) \times \sigma(out_{h6})(1- \sigma(out_{h6})) \times w_{46} \\ &+ (\delta_8 \times w_{78} + \delta_9 \times w_{79}) \times \sigma(out_{h7})(1- \sigma(out_{h7})) \times w_{47} \end{align*}.

Also, similarly compute the gradients for the remaining hidden layers.

Unlock your potential: Neural network series, all in one place!

To continue your exploration of Neural network, check out our series of Answers below:

What are artificial neural networks?
Learn how artificial neural networks (ANNs), inspired by the human brain, perform tasks like classification and prediction through interconnected layers and neurons.
Why do we use neural networks?
Learn how neural networks offer high approximation and representational power, enabling valuable data utilization and excelling in tasks like automated image classification.
Training of a neural network using pytorch
Learn how artificial neural networks mimic brain functions to process data, and how PyTorch simplifies building and training them using layers, weights, loss functions, and backpropagation.
How neural language models work in ChatGPT
Learn how ChatGPT uses transformer architecture with a focus on the decoder, leveraging vast data and attention mechanisms to generate coherent responses.
Benefits and Limitations of Neural Machine Translation in ChatGPT
Learn how ChatGPT's neural machine translation offers efficient, accurate language translations, while acknowledging its limitations due to its novelty.
What are Graph Neural Networks?
Learn how Graph Neural Networks (GNNs) handle non-Euclidean data using graphs, excelling in clustering, visualization, prediction, NLP, molecule structures, cybersecurity, and social network analysis.
What is a neural network-based approach for graph embeddings?
Learn how graph embeddings use neural networks like GCNs to represent graph data as vectors, enabling efficient analysis and tasks like node classification and link prediction.
How to avoid overfitting in neural network
Learn how to use cross-validation, regularization, dropout, early stopping, and data augmentation to effectively avoid overfitting in machine learning models.
How to Do Back Propagation in a Neural Network
Learn how to calculate gradients using backpropagation to update neural network parameters and improve learning from data actions.
PyTorch cheatsheet: Neural network layers
PyTorch provides diverse neural network layers, enabling the design and training of complex models for tasks like image classification, sequence modeling, and reinforcement learning.