What is the dying ReLU problem?

Rectified Linear Unit (ReLU) is a nonlinear activation function used in deep learning.

The above function means that the negative values are mapped to zero and the positive values are returned. It is faster than other activation functions in such a way that all the neurons are not activated at the same time. This makes ReLU faster. This function also solves the vanishing gradient problem because the derivative of the ReLU function is 0 or 1.

A neural network with ReLU as an activation function

What is the dying ReLU problem?

A dying ReLU always outputs the same value, i.e., 0, on any input value. This condition is known as the dead state of ReLU neurons. In this state, it is difficult to recover because the gradient of 0 is 0. This becomes a problem when most of the input ranges are negative, or the derivative of the ReLU function is 0.

The gradient fails to flow during backpropagation because the outputs are 0, and hence the weights are not updated. In the worst case, we get a constant function where the entire neural network dies. A network is born dead if it is dead before training. As long as all the inputs push ReLU to non-negative segments, the dying ReLU problem doesn't occur.

Causes of the dying ReLU

There are two major causes of the dying ReLU problem:

  • Setting high learning rates
  • Having a large negative bias

Let's discuss them in detail.

High learning rates

In neural networks, the weights are updated using the following equation:

If the alpha is too high, when setting the learning rate, the new weights can have a negative range. This is because subtracting a larger value from a smaller one results in a negative value. These negative values become the new inputs to the ReLU and cause the dying ReLU problem.

Large negative bias

In neural networks, a biased term is also passed in the activation function. A large negative biased returns ReLU activation inputs as negative and causes 0 output, resulting in a dying ReLU problem.

Recovering from dying ReLU

Different techniques are used to solve the dying ReLU problem.

Lowering the learning rates and negative bias

Lowering the learning rates and using a positive bias can mitigate the chance of dying ReLU. This pushes the ReLU activation inputs to the non-negative side. This technique helps activate the neurons with the flow of gradient.

Using Leaky ReLU

Another popular technique is Leaky ReLU, as it solves the vanishing gradient problem and then converges fast. In the entire domain, Leaky ReLU has a non-zero gradient. The slope is non-zero for the negative side in Leaky ReLU, which is not the case for general ReLU. Hence, we have a small negative output for the negative input, which helps recover from the dying ReLU.

Some other techniques include the Parametric ReLU and exponential linear units.

Leaky ReLU vs ReLU
Leaky ReLU vs ReLU

Conclusion

The ReLU activation function is widely used in different neural networks. This helps identify the issues related to deep learning problems. The activation function ReLU is most commonly used; therefore, steps should be taken to avoid the dying ReLU problem.