Implementing the Language Model
Learn how to define the layers and model in the implementation.
First, we’ll discuss the hyperparameters that are used for the LSTM and their effects.
Thereafter, we’ll discuss the parameters (weights and biases) required to implement the LSTM. We’ll then discuss how these parameters are used to write the operations taking place within the LSTM. This will be followed by learning how we’ll sequentially feed data to the LSTM. Next, we’ll discuss how to train the model. Finally, we’ll investigate how we can use the learned model to output predictions, which are essentially bigrams that will eventually add up to a meaningful story.
Defining the TextVectorization
layer
We discussed the TextVectorization
layer. We’ll be using the same text vectorization mechanism to tokenize text. In summary, the TextVectorization
layer provides us with a convenient way to integrate text tokenization (i.e., converting strings into a list of tokens that are represented by integer IDs) into the model as a layer.
Here, we’ll define a TextVectorization
layer to convert the sequences of n-grams to sequences of integer IDs:
import tensorflow.keras.layers as layersimport tensorflow.keras.models as models# The vectorization layer that will convert string bigrams to IDstext_vectorizer = tf.keras.layers.TextVectorization(max_tokens=n_vocab, standardize=None,split=None, input_shape=(window_size,))
Note that we are defining several important arguments, such as the max_tokens
(size of the vocabulary), the standardize
argument to not perform any text preprocessing, the split
argument to not perform any splitting, and finally, the input_shape
argument to inform the layer that the input will be a batch of sequences of n-grams. With that, we have to train the text vectorization layer to recognize the available n-grams and map them to unique IDs. We can simply pass our training tf.data
pipeline to this layer to learn the n-grams.
text_vectorizer.adapt(train_ds)
Next, let’s print the words in the vocabulary to see what this layer has learned:
text_vectorizer.get_vocabulary()[:10]
This will output:
['', '[UNK]', 'e ', 'he', ' t', 'th', 'd ', ' a', ', ', ' h']
Once the TextVectorization
layer is trained, we have to modify our training, validation, and testing data pipelines slightly. Remember that our data pipelines output sequences of n-gram strings as inputs and targets. We need to convert the target sequences to sequences of n-gram IDs so that a loss can be computed. For that, we’ll simply pass the targets in the datasets through the text_vectorizer
layer using the tf.data.Dataset.map()
functionality:
train_ds = train_ds.map(lambda x, y: (x, text_vectorizer(y)))valid_ds = valid_ds.map(lambda x, y: (x, text_vectorizer(y)))
Next, we’ll look at the LSTM-based ...