LSTM Variants and Convolutions for Text
Learn about two popular variants of the single-layer LSTM networks—stacked and bidirectional LSTMs.
We'll cover the following...
RNNs are extremely useful when it comes to handling sequential datasets. A simple model can effectively learn to generate text based on what it learned from the training dataset.
Over the years, there have been a number of enhancements in the way we model and use RNNs. In this section, we’ll discuss two widely used variants of the single-layer LSTM network we discussed earlier: stacked and bidirectional LSTMs.
Stacked LSTMs
We are well aware of how the depth of a neural network helps it learn complex and abstract concepts when it comes to computer vision tasks. Along the same lines, a stacked LSTM architecture, which has multiple layers of LSTMs stacked one after the other, has been shown to give considerable improvements. Stacked LSTMs were first presented by Graves et al. in their work “Speech Recognition with Deep Recurrent Neural
Though there isn’t any theoretical proof to explain this performance gain, empirical results help us understand the impact. These enhancements can be attributed to the model’s capacity to learn complex features and even abstract representations of inputs. Since there is a time component associated with LSTMs and RNNs in general, deeper networks learn the ability to operate at different time scales as
As we are using the high-level Keras API, we can easily extend the architecture we used in the previous section to add additional LSTM layers. The following snippet modifies the build_model
function to do just that:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size,is_bidirectional=False):"""Utility to create a model object.Parameters:vocab_size: number of unique charactersembedding_dim: size of embedding vector. This typically inpowers of 2, i.e. 64, 128, 256 and so onrnn_units: number of LSTM units to be usedbatch_size: batch size for training the modelReturns:tf.keras model object"""model = tf.keras.Sequential()model.add(tf.keras.layers.Embedding(vocab_size, embedding_dim,batch_input_shape=[batch_size, None]))if is_bidirectional:model.add(tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(rnn_units,return_sequences=True, stateful=True, recurrent_initializer='glorot_uniform')))else:model.add(tf.keras.layers.LSTM(rnn_units, return_sequences=True,stateful=True, recurrent_initializer='glorot_uniform'))model.add(tf.keras.layers.LSTM(rnn_units,return_sequences=True,stateful=True, recurrent_initializer='glorot_uniform'))model.add(tf.keras.layers.Dense(vocab_size))return model
Line 14: We initialize a sequential model using
tf.keras.Sequential()
.Lines 15–16: We add an embedding layer to the model using
tf.keras.layers.Embedding()
. It takes thevocab_size
,embedding_dim
, andbatch_input_shape
as parameters. Thebatch_input_shape
is set to[batch_size, None]
to allow for variable sequence lengths in the input.Lines 18–28: We add LSTM layers to the model. If
is_bidirectional
isTrue
, a bidirectional LSTM layer is added usingtf.keras.layers.Bidirectional()
. If it isFalse
, two LSTM layers are added sequentially. Finally, a dense layer with sizevocab_size
is added.
The dataset, training loop, and even the inference utilities remain as-is. For brevity, we have skipped presenting those code snippets again. We will discuss the bidirectional argument that we introduce here shortly.
# Greedy decoding print('Greedy Decoding') print(generate_text( model, context_string=u"It was in July, 1805", num_generate=100, mode="greedy")) print() print('Sampled @ 0.3') # Sampled decoding with different temperature settings print(generate_text( model, context_string=u"It was in July, 1805", num_generate=100, mode="sampling", temperature=0.3)) print() print('Sampled @ 0.9') print(generate_text( model, context_string=u"It was in July, 1805", num_generate=100, mode="sampling", temperature=0.9))
Now, let’s see how the results look for this deeper LSTM-based language model. The code output demonstrates the results from this model.
We can clearly see how the generated text is picking up the writing style of the book, capitalization, punctuation, and other aspects better than the outputs shown in the figure above. This highlights some of the advantages we discussed regarding deeper RNN architectures.
Bidirectional LSTMs
The second variant that’s very widely used nowadays is the bidirectional LSTM. We have already discussed how LSTMs, and RNNs in general, condition their outputs by making use of previous timesteps. When it comes to text or any sequence data, this means that the LSTM is able to make use of past context to predict future timesteps. While this is a very useful property, this is not the best we can achieve. Let’s illustrate why this is a limitation through an example:
As is evident from this example, without looking at what is to the right of the target word, “Teddy,” the model would not pick up the context properly. To handle such scenarios, bidirectional LSTMs were introduced. The idea behind them is pretty simple and straightforward. A bidirectional LSTM (or BiLSTM) is a combination of two LSTM layers that work simultaneously. The first is the usual forward LSTM, which takes the ...