Beyond the Sigmoid
Explore ReLU activation function and discover when to use which activation function.
We'll cover the following...
There is no such thing as a perfect replacement for the sigmoid. Different activation functions work well in different circumstances, and researchers keep coming up with the new ones. That being said, one activation function has proven so broadly useful that it’s become a default of sorts. Let’s discuss it in the next section.
Enter the ReLU
The go-to replacement for the sigmoid these days is the rectified linear unit or ReLU. Compared with the sigmoid, the ReLU is surprisingly simple. Here’s a Python implementation of it:
def relu(z):if z <= 0:return 0else:return z
And the following diagram illustrates what it looks like:
The ReLU is composed of two straight segments. However, taken together they add up to a nonlinear function, as a good activation function should be.
The ReLU may be simple, but it’s all the better for it. Computing its gradient is easy, which results in fast training. However, the ReLU’s most useful feature is that gradient of 1 for positive inputs. When backpropagation passes through a ReLU with positive input, the global gradient is multiplied by 1, so it does not change at all. That detail alone solves the problem of vanishing gradients for ...