Note that every layer in layers 2–5 has a weight matrix, and each node in the layers has an associated bias. The output of these layers is given by computing the sigmoid of the forward propagation equation:
Z=WT×X+b
out=σ(Z)
Where,
- W is the weight matrix of the layer
- X is the input from the previous layer
- b is the bias vector of the layer
- σ is the sigmoid activation function
The binary cross-entropy loss is given as follows:
Cost(yi,pi)=−2N1×1∑Ni=1∑2yilog(pi)+(1−yi)log(1−pi)
Where,
- y is the one-hot encoded actual class
- p is the one-hot encoded predicted class
- N is the sample size
Parameter updates
We need to perform parameter updates using the following equation:
wnew=wold−α×Δw
and,
Δw=∂w∂Cost
Where,
- w is the parameter
- α is the learning rate
Therefore, we will backpropagate each parameter’s share in the cost/loss and update it by the fraction determined by the learning rate.
Last layer
Let’s obtain the derivatives (Δw) for the layer five.
For the last layer:
W5=[w68w78w69w79]
∂w68∂Costtotal=∂outo8∂Costtotal×∂neto8∂outo8×∂w68∂neto8
where,
∂w68∂neto8=outh6,
and,
∂neto8∂outo8=outo8(1−outo8),
because,
Costtotal=−2N1∑for batch((y1log(outo8)+(1−y1)log(1−outo8))
+(y2log(outo9)+(1−y2)log(1−outo9))),
we have,
∂outo8∂Costtotal=−2N1∑for batch(outo8y1−1−outo81−y1)=2N1∑for batch(−outo8y1+1−outo81−y1).
Therefore we get the equation for computing Δw68:
∂w68∂Costtotal=2N1∑for batch(−outo8y1+1−outo81−y1)×outo8(1−outo8)×outh6.
Likewise for Δb8, because:
∂b8∂neto8=1,
we get,
∂b8∂Costtotal=∂outo8∂Costtotal×∂neto8∂outo8×∂b8∂neto8=2N1∑for batch(−outo8y1+1−outo81−y1)×outo8(1−outo8).
This multiple is referred to as δ (delta), and we will refer to Δb8 as δ8.
Similarly, we can find Δw69 by the equation below:
∂w69∂Costtotal=∂outo9∂Costtotal×∂neto9∂outo9×∂w69∂neto9
where,
∂outo9∂Costtotal=−2N1∑for batch(outo9y2−1−outo91−y2)=2N1∑for batch(−outo9y2+1−outo91−y2).
and,
∂neto9∂outo9=outo9(1−outo9),
and,
∂w69∂neto9=outh6,
so Δw69 becomes:
∂w69∂Costtotal=2N1∑for batch(−outo9y2+1−outo91−y2)×outo9(1−outo9)×outh6.
Likewise for Δb9:
∂b9∂Costtotal=∂outo9∂Costtotal×∂neto9∂outo9×∂b9∂neto9,
where,
∂b9∂neto9=1,
so,
∂b9∂Costtotal=2N1∑for batch(−outo9y2+1−outo91−y2)×outo9(1−outo9).
We will refer to the multiple Δb9 as δ9.