...

/

LSTM Variants and Convolutions for Text

LSTM Variants and Convolutions for Text

Learn about two popular variants of the single-layer LSTM networks—stacked and bidirectional LSTMs.

RNNs are extremely useful when it comes to handling sequential datasets. A simple model can effectively learn to generate text based on what it learned from the training dataset.

Over the years, there have been a number of enhancements in the way we model and use RNNs. In this section, we’ll discuss two widely used variants of the single-layer LSTM network we discussed earlier: stacked and bidirectional LSTMs.

Stacked LSTMs

We are well aware of how the depth of a neural network helps it learn complex and abstract concepts when it comes to computer vision tasks. Along the same lines, a stacked LSTM architecture, which has multiple layers of LSTMs stacked one after the other, has been shown to give considerable improvements. Stacked LSTMs were first presented by Graves et al. in their work “Speech Recognition with Deep Recurrent Neural NetworksGraves, Alex, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. “Speech Recognition with Deep Recurrent Neural Networks.” ArXiv.org. 2013. https://arxiv.org/abs/1303.5778..” They highlight the fact that depth—multiple layers of RNNs—has a greater impact on performance compared to the number of units per layer.

Press + to interact
Architecture of a stacked LSTM
Architecture of a stacked LSTM

Though there isn’t any theoretical proof to explain this performance gain, empirical results help us understand the impact. These enhancements can be attributed to the model’s capacity to learn complex features and even abstract representations of inputs. Since there is a time component associated with LSTMs and RNNs in general, deeper networks learn the ability to operate at different time scales as wellPascanu, Razvan, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014. “How to Construct Deep Recurrent Neural Networks.” ArXiv:1312.6026 [Cs, Stat], April. https://arxiv.org/abs/1312.6026..

As we are using the high-level Keras API, we can easily extend the architecture we used in the previous section to add additional LSTM layers. The following snippet modifies the build_model function to do just that:

Press + to interact
def build_model(vocab_size, embedding_dim, rnn_units, batch_size,is_bidirectional=False):
"""
Utility to create a model object.
Parameters:
vocab_size: number of unique characters
embedding_dim: size of embedding vector. This typically in
powers of 2, i.e. 64, 128, 256 and so on
rnn_units: number of LSTM units to be used
batch_size: batch size for training the model
Returns:
tf.keras model object
"""
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(vocab_size, embedding_dim,
batch_input_shape=[batch_size, None]))
if is_bidirectional:
model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(rnn_units,
return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform')))
else:
model.add(tf.keras.layers.LSTM(rnn_units, return_sequences=True,
stateful=True, recurrent_initializer='glorot_uniform'))
model.add(tf.keras.layers.LSTM(rnn_units,return_sequences=True,
stateful=True, recurrent_initializer='glorot_uniform'))
model.add(tf.keras.layers.Dense(vocab_size))
return model
  • Line 14: We initialize a sequential model using tf.keras.Sequential().

  • Lines 15–16: We add an embedding layer to the model using tf.keras.layers.Embedding(). It takes the vocab_size, embedding_dim, and batch_input_shape as parameters. The batch_input_shape is set to [batch_size, None] to allow for variable sequence lengths in the input.

  • Lines 18–28: We add LSTM layers to the model. If is_bidirectional is True, a bidirectional LSTM layer is added using tf.keras.layers.Bidirectional(). If it is False, two LSTM layers are added sequentially. Finally, a dense layer with size vocab_size is added.

The dataset, training loop, and even the inference utilities remain as-is. For brevity, we have skipped presenting those code snippets again. We will discuss the bidirectional argument that we introduce here shortly.

# Greedy decoding
print('Greedy Decoding')
print(generate_text(
    model, context_string=u"It was in July, 1805", num_generate=100, mode="greedy"))
print()

print('Sampled @ 0.3')
# Sampled decoding with different temperature settings
print(generate_text(
    model, context_string=u"It was in July, 1805", num_generate=100, mode="sampling", temperature=0.3))
print()

print('Sampled @ 0.9')
print(generate_text(
    model, context_string=u"It was in July, 1805", num_generate=100, mode="sampling", temperature=0.9))
Generation of text using deeper LSTM

Now, let’s see how the results look for this deeper LSTM-based language model. The code output demonstrates the results from this model.

We can clearly see how the generated text is picking up the writing style of the book, capitalization, punctuation, and other aspects better than the outputs shown in the figure above. This highlights some of the advantages we discussed regarding deeper RNN architectures.

Bidirectional LSTMs

The second variant that’s very widely used nowadays is the bidirectional LSTM. We have already discussed how LSTMs, and RNNs in general, condition their outputs by making use of previous timesteps. When it comes to text or any sequence data, this means that the LSTM is able to make use of past context to predict future timesteps. While this is a very useful property, this is not the best we can achieve. Let’s illustrate why this is a limitation through an example:

Press + to interact
Looking at both past and future context windows for a given word
Looking at both past and future context windows for a given word

As is evident from this example, without looking at what is to the right of the target word, “Teddy,” the model would not pick up the context properly. To handle such scenarios, bidirectional LSTMs were introduced. The idea behind them is pretty simple and straightforward. A bidirectional LSTM (or BiLSTM) is a combination of two LSTM layers that work simultaneously. The first is the usual forward LSTM, which takes the ...