Fundamentals of Machine Learning for Software Engineers/

...

From the Chain Rule to Backpropagation

Learn how to apply chain rule in backpropagation.

We'll cover the following...

The chain rule on a simple network
Chain rule on a complicated network

This is not a neural network, because it does not have weights. Let’s borrow a term from computer science, and call it a computational graph. This graph has an input $a$ , followed by two operations: multiply by two and square. The output of the multiplication is called $b$ , and the output of the entire graph is called $c$ .

Now let’s say that we want to calculate ∂ $c$ /∂ $a$ , the gradient of $c$ with respect to $a$ . Intuitively, that gradient represents the impact of $a$ on $c$ . Whenever $a$ changes, $c$ also changes, and the gradient measures the amount of change. (If you find the gradientcomplex, review the Gradient Descent lesson).

For such a small graph, we could calculate ∂ $c$ /∂ $a$ in a single step, by taking the derivative of $c$ with respect to $a$ . However, as we mentioned earlier, that derivation would become impractical for very large graphs. Instead, let’s calculate the gradient using the chain rule, which works for graphs of any size.

Here is how the chain rule works. To calculate ∂ $c$ /∂ $a$ :

Walk the graph back from $c$ to $a$ .
For each operation along the way, calculate its local gradient—the derivative of the operation’s output with respect to its input.
Multiply all the local gradients together.

Let’s see how that process works in practice. In our case, the path back from $c$ to $a$ involves two operations:

A square
A multiplication by $2$ .

Let’s compile the local gradients of those two operations:

How do we know that ∂ $b$ /∂ $a$ is $2$ , and ∂ $c$ /∂ $b$ is $4a$ ? Well, even though we use the chain rule, we must still compute the local gradients in an old-fashioned way, by taking derivatives by hand. However, don’t worry if you do not know how to take derivatives. We can always use libraries to do that. For now, we just have to understand the process.

Now that we have the local gradients, we can multiply them to get ∂ $c$ /∂ $a$ :

\large{ \frac{\partial c}{\partial a} = \frac{\partial c}{\partial b}.\frac{\partial b}{\partial a} = 4a.2 = 8a}

How Machine Learning Works

Our First Learning Program

Walking the Gradient

Hyperspace

A Discern Machine

Get Real

The Final Challenge

The Perceptron

Designing the Network

Building the Network

Training the Network

How Classifiers Work

Batchin’ Up

The Zen of Testing

Let’s Do Development

A Deeper Kind of Network

Diabetes Prediction Using Keras

Defeating Overfitting

Taming Deep Networks

Beyond Vanilla Networks

Into the Deep

Recognize Handwritten Digits Using a Deep Neural Network

Machine Learning Fundamentals

From the Chain Rule to Backpropagation

The chain rule on a simple network