Understanding Deep Learning Applications in Rare Event Prediction/

...

Understanding LSTM Activations and Stabilized Gradients

Explore the key role of activations in LSTMs, how they influence the network's ability to process, and how they remember information over time.

We'll cover the following...

Activations in LSTM
Parameters
Iteration levels
Stabilized gradient

Activations in LSTM

The activations below:

\tilde{c} = \text{tanh}(w_c^{(x)}x_t + w_c^{(h)}h_{t-1} + b_c)

for $\tilde{c}_t$ and

h_t = o_t\text{tanh}(c_t)

for emitting $h_t$ correspond to the activation argument in an LSTM layer in TensorFlow. By default, it is tanh. These expressions act as learned features and, therefore, can take any value. With tanh activation, they are in $(−1, 1)$ . Other suitable activations can also be used for them. On the other hand, the activations for input, output, and forget gates are referred to as the argument recurrent_activation in TensorFlow. These gates act as scales. Therefore, they are intended to stay in $(0,1)$ . Their default is, hence, sigmoid. For most purposes, it’s essential to keep recurrent_activation as sigmoid.

Note: The recurrent_activation should be sigmoid. The default activation is tanh but can be set to other activations such as relu.

Parameters

Suppose an LSTM layer has $m$ cells, that is, the layer size equal to $m$ . The cell mechanism is for one cell in an LSTM layer. The parameters involved in the cell are, $w_·^{(h)}, w_·^{(x)}, b_·$ , where $·$ is $c,i,f,$ and $o$ .

A cell intakes the prior output of all the other sibling cells in the layer. Given the layer size is $m$ , the prior output from the layer cells will be an $m$ -vector $h_{t−1}$ and, therefore, the $w_·^{(h)}$ are also of the same length $m$ .

The weight for the input time-step $x_t$ is a $p$ -vector given there are $p$ features, that is, $x_t ∈ \mathbb{R}^p$ . Lastly, the bias on a cell is a scalar.

Combining them for each of $c, i, f, o,$ the total number of parameters in a cell is $4(m+p+1)$ .

In the LSTM layer, there are $m$ cells. Therefore, the total number of parameters in a layer are:

n\_parameters = 4m(m + p + 1)

Getting Started

Rare Event Prediction

Multi-Layer Perceptrons (MLPs)

Long Short-Term Memory (LSTM) Networks

Convolutional Neural Networks (CNNs)

Autoencoders

Conclusion

Understanding LSTM Activations and Stabilized Gradients

Activations in LSTM

Parameters