Understanding LSTM Activations and Stabilized Gradients

Explore the key role of activations in LSTMs, how they influence the network's ability to process, and how they remember information over time.

Activations in LSTM

The activations below:

c~=tanh(wc(x)xt+wc(h)ht1+bc)\tilde{c} = \text{tanh}(w_c^{(x)}x_t + w_c^{(h)}h_{t-1} + b_c)

for c~t\tilde{c}_t and

ht=ottanh(ct)h_t = o_t\text{tanh}(c_t)

for emitting hth_t correspond to the activation argument in an LSTM layer in TensorFlow. By default, it is tanh. These expressions act as learned features and, therefore, can take any value. With tanh activation, they are in (1,1)(−1, 1). Other suitable activations can also be used for them. On the other hand, the activations for input, output, and forget gates are referred to as the argument recurrent_activation in TensorFlow. These gates act as scales. Therefore, they are intended to stay in (0,1)(0,1). Their default is, hence, sigmoid. For most purposes, it’s essential to keep recurrent_activation as sigmoid.

Note: The recurrent_activation should be sigmoid. The default activation is tanh but can be set to other activations such as relu.

Parameters

Suppose an LSTM layer has mm cells, that is, the layer size equal to mm. The cell mechanism is for one cell in an LSTM layer. The parameters involved in the cell are, w(h),w(x),bw_·^{(h)}, w_·^{(x)}, b_·, where · is c,i,f,c,i,f, and oo.

A cell intakes the prior output of all the other sibling cells in the layer. Given the layer size is mm, the prior output from the layer cells will be an mm-vector ht1h_{t−1} and, therefore, the w(h)w_·^{(h)} are also of the same length mm.

The weight for the input time-step xtx_t is a pp-vector given there are pp features, that is, xtRpx_t ∈ \mathbb{R}^p. Lastly, the bias on a cell is a scalar.

Combining them for each of c,i,f,o,c, i, f, o, the total number of parameters in a cell is 4(m+p+1)4(m+p+1).

In the LSTM layer, there are mm cells. Therefore, the total number of parameters in a layer are:

n_parameters=4m(m+p+1)n\_parameters = 4m(m + p + 1)

Note: The number of parameters is independent of the number of time-steps the cell processes. That is, they’re independent of the window size ττ.

This implies that the parameter space doesn’t increase if the window size is expanded to learn longer-term temporal patterns. While this might appear an advantage, in practice, the performance deteriorates after a certain limit on the window size.

An LSTM layer has 4m(m+p+1)4m(m + p + 1) parameters, where mm is the size of the layer and pp the number of features in the input.

The number of LSTM parameters is independent of the sample window size.

Iteration levels

A sample in LSTM is a window of time-step observations. Due to this, its iteration level shown in the illustration below goes one level further than in MLPsMulti-layer Perceptrons. In LSTMs, the iterations end at a time-step.

Get hands-on with 1200+ tech skills courses.