Understanding LSTM Activations and Stabilized Gradients
Understand how LSTM activations work, including cell state and gate functions, and learn how stabilized gradients in LSTM networks prevent vanishing or exploding gradients, enabling modeling of long-term dependencies in sequential data. This lesson covers parameters, activation functions, and why sigmoid recurrent_activation is critical for effective deep learning in rare event prediction.
We'll cover the following...
Activations in LSTM
The activations below:
for and
for emitting correspond to the activation argument in an LSTM layer in TensorFlow. By default, it is tanh. These expressions act as learned features and, therefore, can take any value. With tanh activation, they are in . Other suitable activations can also be used for them.
On the other hand, the activations for input, output, and forget gates are referred to as the argument recurrent_activation in TensorFlow. These gates act as scales. Therefore, they are intended to stay in . Their default is, hence, sigmoid. For most purposes, it’s essential to keep recurrent_activation as sigmoid.
Note: The
recurrent_activationshould besigmoid. The default activation istanhbut can be set to other activations such asrelu.
Parameters
Suppose an LSTM layer has cells, that is, the layer size equal to . The cell mechanism is for one cell in an LSTM layer. The parameters involved in the cell are, , where is and .
A cell intakes the prior output of all the other sibling cells in the layer. Given the layer size is , the prior output from the layer cells will be an -vector and, therefore, the are also of the same length .
The weight for the input time-step is a -vector given there are features, that is, . Lastly, the bias on a cell is a scalar.
Combining them for each of the total number of parameters in a cell is .
In the LSTM layer, there are cells. Therefore, the total number of parameters in a layer are:
...