About this chapter

There’s nothing special about deep networks. They are like shallow neural networks, only with more layers. However, when people started experimenting with deep networks, they realized that building deep networks may be easy, but training them is not.

Backpropagation on deep networks comes with its own specific challenges such as vanishing gradients and dead neurons. Those challenges rarely come up in shallow neural networks.

Over the years, neural networks researchers developed a collection of strategies to tackle those challenges and tame deep neural networks:

New activation functions to replace the sigmoid
Multiple flavors of gradient descent
More effective weight initializations
Better regularization techniques to counter overfitting
Other ideas that work, though they do not quite fit any of these categories

This chapter is a whirlwind tour through these techniques. We’ll spend most of our time discussing activation functions. Moreover, we’ll discuss why the sigmoid does not pass muster in deep neural networks, and how to replace them. Then we’ll conclude the chapter with a few choices and approaches from the other categories listed above.

So far, all our activation functions have been sigmoids, except in the output layer, where we used the softmax function.

The sigmoid has been with us for a long time. It was originally introduced to squash the output of a perceptron so that it ranged from 0 to 1. Later on, we got introduced with the softmax to rescale a neural network’s outputs so that they added up to 1. By rescaling the outputs, we could interpret them as probabilities. However, now that we are building deep neural networks, those original motivations feel far away. Activation functions complicate our neural networks and do not seem to give us much in exchange.

This network sure looks simpler than the earlier one. However, it comes with a limitation: all its operations are linear, meaning that they could be plotted with straight shapes. To explore the consequences of this linearity, let’s work through a tiny bit of math.

In a network without activation functions, each of the n layers is the weighted sum of the nodes in the previous layer:

\large {layer_2= layer_1 \space . \space weight_1}

...

How Machine Learning Works

Our First Learning Program

Walking the Gradient

Hyperspace

A Discern Machine

Get Real

The Final Challenge

The Perceptron

Designing the Network

Building the Network

Training the Network

How Classifiers Work

Batchin’ Up

The Zen of Testing

Let’s Do Development

A Deeper Kind of Network

Diabetes Prediction Using Keras

Defeating Overfitting

Taming Deep Networks

Beyond Vanilla Networks

Into the Deep

Recognize Handwritten Digits Using a Deep Neural Network

Machine Learning Fundamentals

Understand Activation Functions

About this chapter

The purpose of activation functions