Understand Activation Functions

Learn the role of activation functions while building a deep neural network model.

About this chapter

There’s nothing special about deep networks. They are like shallow neural networks, only with more layers. However, when people started experimenting with deep networks, they realized that building deep networks may be easy, but training them is not.

Backpropagation on deep networks comes with its own specific challenges such as vanishing gradients and dead neurons. Those challenges rarely come up in shallow neural networks.

Over the years, neural networks researchers developed a collection of strategies to tackle those challenges and tame deep neural networks:

  • New activation functions to replace the sigmoid
  • Multiple flavors of gradient descent
  • More effective weight initializations
  • Better regularization techniques to counter overfitting
  • Other ideas that work, though they do not quite fit any of these categories

This chapter is a whirlwind tour through these techniques. We’ll spend most of our time discussing activation functions. Moreover, we’ll discuss why the sigmoid does not pass muster in deep neural networks, and how to replace them. Then we’ll conclude the chapter with a few choices and approaches from the other categories listed above.

The purpose of activation functions

By now, we are familiar with activation functions, those cyan boxes in between a neural network’s layers, as shown in the figure below:

So far, all our activation functions have been sigmoids, except in the output layer, where we used the softmax function.

The sigmoid has been with us for a long time. It was originally introduced to squash the output of a perceptron so that it ranged from 0 to 1. Later on, we got introduced with the softmax to rescale a neural network’s outputs so that they added up to 1. By rescaling the outputs, we could interpret them as probabilities. However, now that we are building deep neural networks, those original motivations feel far away. Activation functions complicate our neural networks and do not seem to give us much in exchange.

Let’s see what happens if we remove the activation functions from a neural network. The neural network without the activation functions is as follows:

This network sure looks simpler than the earlier one. However, it comes with a limitation: all its operations are linear, meaning that they could be plotted with straight shapes. To explore the consequences of this linearity, let’s work through a tiny bit of math.

In a network without activation functions, each of the n layers is the weighted sum of the nodes in the previous layer:

layer2=layer1 . weight1\large {layer_2= layer_1 \space . \space weight_1}

...