Introduction: Understanding Long Short-Term Memory Networks

Get an overview of long short-term memory networks.

How long short-term memory networks work

In this chapter, we’ll discuss the fundamentals behind a more advanced RNN variant known as long short-term memory networks (LSTMs). Here, we’ll focus on understanding the theory behind LSTMs so we can discuss their implementation later. LSTMs are widely used in many sequential tasks (including stock market prediction, language modeling, and machine translation) and have proven to perform better than older sequential models (for example, standard RNNs), especially given the availability of large amounts of data. LSTMs are designed to avoid the problem of the vanishing gradient.

The main practical limitation posed by the vanishing gradient is that it prevents the model from learning long-term dependencies. However, by avoiding the vanishing gradient problem, LSTMs have the ability to store memory for longer than ordinary RNNs (for hundreds of time steps). In contrast to RNNs, which only maintain a single hidden state, LSTMs have many more parameters and better control over what memory to store and what to discard at a given training step. For example, RNNs can’t decide which memory to store and which to discard because the hidden state is forced to be updated at every training step.

Specifically, we’ll discuss what an LSTM is at a very high level and how the functionality of LSTMs allows them to store long-term dependencies. Then, we’ll go into the actual underlying mathematical framework governing LSTMs and discuss an example to highlight why each computation matters. We will also compare LSTMs to vanilla RNNs and see that LSTMs have a much more sophisticated architecture that allows them to surpass vanilla RNNs in sequential tasks. Revisiting the problem of the vanishing gradient and illustrating it through an example will lead us to understand how LSTMs solve the problem.

After, we’ll discuss several techniques that have been introduced to improve the predictions produced by a standard LSTM (for example, improving the quality and variety of generated text in a text generation task). For example, generating several predictions at once instead of predicting them one by one can help to improve the quality of generated predictions. We’ll also look at bidirectional LSTMs (BiLSTMs), which are an extension to the standard LSTM that have greater capabilities for capturing the patterns present in a sequence than a standard LSTM.

Get hands-on with 1400+ tech skills courses.