...

/

Implementing the Language Model

Implementing the Language Model

Learn how to define the layers and model in the implementation.

First, we’ll discuss the hyperparameters that are used for the LSTM and their effects.

Thereafter, we’ll discuss the parameters (weights and biases) required to implement the LSTM. We’ll then discuss how these parameters are used to write the operations taking place within the LSTM. This will be followed by learning how we’ll sequentially feed data to the LSTM. Next, we’ll discuss how to train the model. Finally, we’ll investigate how we can use the learned model to output predictions, which are essentially bigrams that will eventually add up to a meaningful story.

Defining the TextVectorization layer

We discussed the TextVectorization layer. We’ll be using the same text vectorization mechanism to tokenize text. In summary, the TextVectorization layer provides us with a convenient way to integrate text tokenization (i.e., converting strings into a list of tokens that are represented by integer IDs) into the model as a layer.

Here, we’ll define a TextVectorization layer to convert the sequences of n-grams to sequences of integer IDs:

import tensorflow.keras.layers as layers
import tensorflow.keras.models as models
# The vectorization layer that will convert string bigrams to IDs
text_vectorizer = tf.keras.layers.TextVectorization(
max_tokens=n_vocab, standardize=None,
split=None, input_shape=(window_size,)
)

Note that we are defining several important arguments, such as the max_tokens (size of the vocabulary), the standardize argument to not perform any text preprocessing, the split argument to not perform any splitting, and finally, the input_shape argument to inform the layer that the input will be a batch of sequences of n-grams. With that, we have to train the text vectorization layer to recognize the available n-grams and map them to unique IDs. We can simply pass our training tf.data pipeline to this layer to learn the n-grams.

text_vectorizer.adapt(train_ds)

Next, let’s print the words in the vocabulary to see what this layer has learned:

text_vectorizer.get_vocabulary()[:10]

This will output:

['', '[UNK]', 'e ', 'he', ' t', 'th', 'd ', ' a', ', ', ' h']

Once the TextVectorization layer is trained, we have to modify our training, validation, and testing data pipelines slightly. Remember that our data pipelines output sequences of n-gram strings as inputs and targets. We need to convert the target sequences to sequences of n-gram IDs so that a loss can be computed. For that, we’ll simply pass the targets in the datasets through the text_vectorizer layer using the tf.data.Dataset.map() functionality:

train_ds = train_ds.map(lambda x, y: (x, text_vectorizer(y)))
valid_ds = valid_ds.map(lambda x, y: (x, text_vectorizer(y)))

Next, we’ll look at the LSTM-based ...