Rectified Linear Unit (ReLU) is a nonlinear activation function used in deep learning.
The above function means that the negative values are mapped to zero and the positive values are returned. It is faster than other activation functions in such a way that all the neurons are not activated at the same time. This makes ReLU faster. This function also solves the vanishing gradient problem because the derivative of the ReLU function is 0 or 1.
A dying ReLU always outputs the same value, i.e., 0, on any input value. This condition is known as the dead state of ReLU neurons. In this state, it is difficult to recover because the gradient of 0 is 0. This becomes a problem when most of the input ranges are negative, or the derivative of the ReLU function is 0.
The gradient fails to flow during backpropagation because the outputs are 0, and hence the weights are not updated. In the worst case, we get a constant function where the entire neural network dies. A network is born dead if it is dead before training. As long as all the inputs push ReLU to non-negative segments, the dying ReLU problem doesn't occur.
There are two major causes of the dying ReLU problem:
Let's discuss them in detail.
In neural networks, the weights are updated using the following equation:
If the alpha is too high, when setting the learning rate, the new weights can have a negative range. This is because subtracting a larger value from a smaller one results in a negative value. These negative values become the new inputs to the ReLU and cause the dying ReLU problem.
In neural networks, a biased term is also passed in the activation function. A large negative biased returns ReLU activation inputs as negative and causes 0 output, resulting in a dying ReLU problem.
Different techniques are used to solve the dying ReLU problem.
Lowering the learning rates and using a positive bias can mitigate the chance of dying ReLU. This pushes the ReLU activation inputs to the non-negative side. This technique helps activate the neurons with the flow of gradient.
Another popular technique is Leaky ReLU, as it solves the vanishing gradient problem and then converges fast. In the entire domain, Leaky ReLU has a non-zero gradient. The slope is non-zero for the negative side in Leaky ReLU, which is not the case for general ReLU. Hence, we have a small negative output for the negative input, which helps recover from the dying ReLU.
Some other techniques include the Parametric ReLU and exponential linear units.
The ReLU activation function is widely used in different neural networks. This helps identify the issues related to deep learning problems. The activation function ReLU is most commonly used; therefore, steps should be taken to avoid the dying ReLU problem.