How to do backpropagation in a neural network

In this Answer, we’ll see how we can backpropagate the loss in a neural network and update the parameters to make the model learn from the data. Consider the following neural network with a cross-entropy loss function:

G cluster_0 Layer 1 cluster_1 Layer 2 cluster_2 Layer 3 cluster_3 Layer 4 cluster_4 Layer 5 x1 0 a1 2 x1->a1 a2 3 x1->a2 x2 1 x2->a1 x2->a2 b1 4 a1->b1 b2 5 a1->b2 a2->b1 a2->b2 c1 6 b1->c1 c2 7 b1->c2 b2->c1 b2->c2 o1 8 c1->o1 o2 9 c1->o2 c2->o1 c2->o2
A multilayered perceptron with three hidden layers and an output layer of size two.

Note that every layer in layers 2–5 has a weight matrix, and each node in the layers has an associated bias. The output of these layers is given by computing the sigmoid of the forward propagation equation:

Z=WT×X+bZ = W^\text{T}\times X + b

out=σ(Z)out = \sigma(Z)

Where,

  • WW is the weight matrix of the layer
  • XX is the input from the previous layer
  • bb is the bias vector of the layer
  • σ\sigma is the sigmoid activation function

The binary cross-entropy loss is given as follows:

Cost(yi,pi)=12N×1Ni=12yilog(pi)+(1yi)log(1pi)Cost(y_i, p_i) = - \frac{1}{2 N} \times \sum_{1}^{N} \sum_{i=1}^{2}y_i log(p_i) + (1-y_i)log(1-p_i)

Where,

  • yy is the one-hot encoded actual class
  • pp is the one-hot encoded predicted class
  • NN is the sample size

Parameter updates

We need to perform parameter updates using the following equation:

wnew=woldα×Δww_{new} = w_{old} - α \times Δw

and,

Δw=CostwΔw = \frac{\partial Cost}{\partial w}

Where,

  • ww is the parameter
  • αα is the learning rate

Therefore, we will backpropagate each parameter’s share in the cost/loss and update it by the fraction determined by the learning rate.

Last layer

Let’s obtain the derivatives (ΔwΔw) for the layer five.

For the last layer:

W5=[w68w69w78w79]W_5 = \begin{bmatrix} w_{68} & w_{69} \\ w_{78} & w_{79} \end{bmatrix}

Costtotalw68=Costtotalouto8×outo8neto8×neto8w68\frac{\partial Cost_{total}}{\partial w_{68}} = \frac{\partial Cost_{total}}{\partial out_{o8}} \times \frac{\partial out_{o8}}{\partial{net_{o8}}}\times\frac{\partial net_{o8}}{\partial w_{68}}

where,

neto8w68=outh6,\frac{\partial net_{o8}}{\partial w_{68}} = out_{h6},

and,

outo8neto8=outo8(1outo8),\frac{\partial out_{o8}}{\partial{net_{o8}}} = out_{o8}(1-out_{o8}),

because,

Costtotal=12Nfor batch((y1log(outo8)+(1y1)log(1outo8))Cost_{total} = - \frac{1}{2N} \sum^{for\ batch}(( y_1 log(out_{o8}) + (1-y_1)log(1-out_{o8}))

+(y2log(outo9)+(1y2)log(1outo9))),+ (y_2 log(out_{o9}) + (1-y_2)log(1-out_{o9}))),

we have,

Costtotalouto8=12Nfor batch(y1outo81y11outo8)=12Nfor batch(y1outo8+1y11outo8).\begin{align*} \frac{\partial Cost_{total}}{\partial out_{o8}} &= - \frac{1}{2N} \sum^{\text{for\ batch}} \left(\frac{y_{1}}{out_{o8}} - \frac{1-y_1}{1-out_{o8}}\right) \\ &= \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_1}{out_{o8}} + \frac{1-y_1}{1-out_{o8}}\right) \end{align*}.

Therefore we get the equation for computing Δw68\Delta w_{68}:

Costtotalw68=12Nfor batch(y1outo8+1y11outo8)×outo8(1outo8)×outh6.\frac{\partial Cost_{total}}{\partial w_{68}} = \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_1}{out_{o8}} + \frac{1-y_1}{1-out_{o8}}\right) \times out_{o8}(1-out_{o8}) \times out_{h6}.

Likewise for Δb8\Delta b_8, because:

neto8b8=1,\frac{\partial net_{o8}}{\partial b_{8}} = 1,

we get,

Costtotalb8=Costtotalouto8×outo8neto8×neto8b8=12Nfor batch(y1outo8+1y11outo8)×outo8(1outo8).\begin{align*} \frac{\partial Cost_{total}}{\partial b_{8}} &= \frac{\partial Cost_{total}}{\partial out_{o8}} \times \frac{\partial out_{o8}}{\partial{net_{o8}}}\times\frac{\partial net_{o8}}{\partial b_{8}} \\ &= \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_1}{out_{o8}} + \frac{1-y_1}{1-out_{o8}}\right) \times out_{o8}(1-out_{o8}) \end{align*}.

This multiple is referred to as δ\delta (delta), and we will refer to Δb8\Delta b_8 as δ8\delta_8.

Similarly, we can find Δw69\Delta w_{69} by the equation below:

Costtotalw69=Costtotalouto9×outo9neto9×neto9w69\frac{\partial Cost_{total}}{\partial w_{69}} = \frac{\partial Cost_{total}}{\partial out_{o9}} \times \frac{\partial out_{o9}}{\partial{net_{o9}}}\times\frac{\partial net_{o9}}{\partial w_{69}}

where,

Costtotalouto9=12Nfor batch(y2outo91y21outo9)=12Nfor batch(y2outo9+1y21outo9).\begin{align*} \frac{\partial Cost_{total}}{\partial out_{o9}} &= - \frac{1}{2N} \sum^{\text{for\ batch}} \left(\frac{y_2}{out_{o9}} - \frac{1-y_2}{1-out_{o9}}\right) \\ &= \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_2}{out_{o9}} + \frac{1-y_2}{1-out_{o9}}\right)\end{align*}.

and,

outo9neto9=outo9(1outo9),\frac{\partial out_{o9}}{\partial{net_{o9}}} = out_{o9}(1-out_{o9}),

and,

neto9w69=outh6,\frac{\partial net_{o9}}{\partial w_{69}} = out_{h6},

so Δw69\Delta w_{69} becomes:

Costtotalw69=12Nfor batch(y2outo9+1y21outo9)×outo9(1outo9)×outh6.\frac{\partial Cost_{total}}{\partial w_{69}} = \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_2}{out_{o9}} + \frac{1-y_2}{1-out_{o9}}\right) \times out_{o9}(1-out_{o9}) \times out_{h6}.

Likewise for Δb9\Delta b_9:

Costtotalb9=Costtotalouto9×outo9neto9×neto9b9,\frac{\partial Cost_{total}}{\partial b_{9}} = \frac{\partial Cost_{total}}{\partial out_{o9}} \times \frac{\partial out_{o9}}{\partial{net_{o9}}}\times\frac{\partial net_{o9}}{\partial b_{9}},

where,

neto9b9=1,\frac{\partial net_{o9}}{\partial b_{9}} = 1,

so,

Costtotalb9=12Nfor batch(y2outo9+1y21outo9)×outo9(1outo9).\frac{\partial Cost_{total}}{\partial b_{9}} = \frac{1}{2N} \sum^{\text{for\ batch}} \left(-\frac{y_2}{out_{o9}} + \frac{1-y_2}{1-out_{o9}}\right) \times out_{o9}(1-out_{o9}).

We will refer to the multiple Δb9\Delta b_9 as δ9\delta_9.

Hidden layers

We need to take into account the contributions to the error by each node in a layer, to be able to find the gradient updates of weights of that node. This contribution for a nodes of the second last layer is:

Costtotalouth6=Costo8outh6+Costo9outh6.\frac{\partial Cost_{total}}{\partial out_{h6}} = \frac{\partial Cost_{o8}}{\partial out_{h6}} + \frac{\partial Cost_{o9}}{\partial out_{h6}}.

where,

Costo8outh6=Costtotalouto8×outo8neto8×neto8outh6=δ8×w68,\begin{align*}\frac{\partial Cost_{o8}}{\partial out_{h6}} &= \frac{\partial Cost_{total}}{\partial out_{o8}} \times \frac{\partial out_{o8}}{\partial net_{o8}} \times \frac{\partial net_{o8}}{\partial out_{h6}}\\ &= \delta_8 \times w_{68} \end{align*},

and,

Costo9outh6=Costtotalouto9×outo9neto9×neto9outh6=δ9×w69,\begin{align*} \frac{\partial Cost_{o9}}{\partial out_{h6}} &= \frac{\partial Cost_{total}}{\partial out_{o9}} \times \frac{\partial out_{o9}}{\partial net_{o9}} \times \frac{\partial net_{o9}}{\partial out_{h6}}\\ &= \delta_9 \times w_{69} \end{align*},

Therefore,

Costtotalouth6=δ8×w68+δ9×w69.\begin{align*} \frac{\partial Cost_{total}}{\partial out_{h6}} = \delta_8 \times w_{68} + \delta_9 \times w_{69} \end{align*}.

To compute Δw46\Delta w_{46}:

Costtotalw46=Costtotalouth6×outh6neth6×neth6w46\begin{align*} \frac{\partial Cost_{total}}{\partial w_{46}} = \frac{\partial Cost_{total}}{\partial out_{h6}} \times \frac{\partial out_{h6}}{\partial net_{h6}} \times \frac{\partial net_{h6}}{\partial w_{46}} \end{align*}

We can then compute:

Costtotalouth4=Costtotalouth6×outh6neth6×neth6outh4+Costtotalouth7×outh7neth7×neth7outh4=(δ8×w68+δ9×w69)×σ(outh6)(1σ(outh6))×w46+(δ8×w78+δ9×w79)×σ(outh7)(1σ(outh7))×w47.\begin{align*} \frac{\partial Cost_{total}}{\partial out_{h4}} &= \frac{\partial Cost_{total}}{\partial out_{h6}} \times \frac{\partial out_{h6}}{\partial net_{h6}} \times \frac{\partial net_{h6}}{\partial out_{h4}} + \frac{\partial Cost_{total}}{\partial out_{h7}} \times \frac{\partial out_{h7}}{\partial net_{h7}} \times \frac{\partial net_{h7}}{\partial out_{h4}} \\ &= (\delta_8 \times w_{68} + \delta_9 \times w_{69}) \times \sigma(out_{h6})(1- \sigma(out_{h6})) \times w_{46} \\ &+ (\delta_8 \times w_{78} + \delta_9 \times w_{79}) \times \sigma(out_{h7})(1- \sigma(out_{h7})) \times w_{47} \end{align*}.

Also, similarly compute the gradients for the remaining hidden layers.

Free Resources

Copyright ©2025 Educative, Inc. All rights reserved