Understanding Long Short-Term Memory Networks
Get a basic understanding of LSTMs.
In this lesson, we’ll first explain how an LSTM cell operates. In addition to the hidden states, we’ll see that a gating mechanism is in place to control information flow inside the cell. Then, we’ll work through a detailed example and see how gates and states help at various stages of the example to achieve desired behaviors, leading to the desired output. Finally, we will compare an LSTM against a standard RNN to learn how an LSTM differs from a standard RNN.
What is an LSTM?
LSTMs can be seen as a more complex and capable family of RNNs. Though LSTMs are a complicated beast, the underlying principles of LSTMs are as same as those of RNNs; they process a sequence of items by working on one input at a time in a sequential order. An LSTM is mainly composed of five different components:
Cell state: This is the internal cell state (that is, memory) of an LSTM cell.
Hidden state: This is the external hidden state exposed to other layers and used to calculate predictions.
Input gate: This determines how much of the current input is read into the cell state.
Forget gate: This determines how much of the previous cell state is sent into the current cell state.
Output gate: This determines how much of the cell state is output into the hidden state.
We can wrap the RNN to a cell architecture as follows: the cell will output some state (with a nonlinear activation function) that is dependent on the previous cell state and the current input. However, in RNNs, the cell state is continuously updated with every incoming input. This behavior is quite undesirable for storing long-term dependencies.
LSTMs can decide when to add, update, or forget information stored in each neuron in the cell state. In other words, LSTMs are equipped with a mechanism to keep the cell state unchanged (if warranted for better performance), giving them the ability to store long-term dependencies.
This is achieved by introducing a gating mechanism. LSTMs possess gates for each operation the cell needs to perform. The gates are continuous (often sigmoid functions) between 0 and 1, where 0 means no information flows through the gate, and 1 means all the information flows through the gate. An LSTM uses one such gate for each neuron in the cell. As explained above, these gates control the following:
How much of the current input is written to the cell state (input gate)
How much information is forgotten from the previous cell state (forget gate)
How much information is output into the final hidden state from the cell state (output gate)
Data functionality in LSTM models
The figure below illustrates this functionality for a hypothetical scenario. Each gate decides how much of various data (for example, the current input, the previous hidden state, or the previous cell state) flows into the states (that is, the final hidden state or the cell state). The thickness of each line represents how much information is flowing to or from that gate (in some hypothetical scenarios). For example, in this figure, we can see that the input gate is allowing more from the current input than from the previous final hidden state, whereas the forget gate allows more from the previous final hidden state than from the current input:
LSTMs in more detail
Here, we’ll walk through the actual mechanism of LSTMs. We’ll first briefly discuss the overall view of an LSTM cell and then start discussing each of the computations crunched within an LSTM cell, along with an example of text generation.
As we discussed earlier, LSTMs have a gating mechanism composed of the following three gates:
Input gate: A gate that outputs values between 0 (the current input is not written to the cell state) and 1 (the current input is fully written to the cell state). Sigmoid activation is used to squash the output to between 0 and 1.
Forget gate: A sigmoidal gate that outputs values between 0 (the previous cell state is fully forgotten for calculating the current cell state) and 1 (the previous cell state is fully read in when calculating the current cell state).
Output gate: A sigmoidal gate that outputs values between 0 (the current cell state is fully discarded for calculating the final state) and 1 (the current cell state is fully used when calculating the final hidden state).
This can be shown in the figure below. This is a very high-level diagram, and some details have been omitted in order to avoid clutter. We present LSTMs, both with loops and without loops, to improve our understanding. The figure on the right-hand side depicts an LSTM with loops, and the one on the left-hand side shows the same LSTM with the loops unfolded so that no loops are present in the model:
Language model example to understand LSTMs
Now, to get a better understanding of LSTMs, let’s consider a language modeling example. We’ll discuss the actual update rules and equations side by side with the example to ground our understanding of LSTMs better.
Let’s consider an example of generating text starting from the following sentence:
John gave Mary a puppy.
The story that we output should be about John, Mary, and the puppy. Let’s assume our LSTM outputs two sentences following the given sentence:
John gave Mary a puppy. _____________________. _____________________.
The following is the output given by our LSTM:
John gave Mary a puppy. It barks very loudly. They named it Luna.
We’re still far from outputting realistic phrases such as these. However, LSTMs can learn relationships, such as between nouns and pronouns. For example, “it” is related to “puppy,” and “they” to “John” and “Mary.” Then, it should learn the relationship between the noun/pronoun and the verb. For example, for “it,” the verb should have an “s” at the end. We illustrate these relationships/dependencies in the figure below. As we can see, both long-term (for example, “Luna” → “puppy”) and short-term (for example, “It” → “barks”) dependencies are present in this phrase. The solid arrows depict links between nouns and pronouns, and the dashed arrows show links between nouns/pronouns and verbs:
Now, let’s consider how LSTMs, using their various operations, can model such relationships and dependencies to output sensible text given a starting sentence.
The input gate