Attention

The LSTM-based architecture we used to prepare our first language model for text generation had one major limitation. The RNN layer (generally speaking, it could be LSTM, GRU, etc.) takes in a context window of a defined size as input and encodes all of it into a single vector. This bottleneck vector needs to capture a lot of information before the decoding stage can be used to start generating the next token.

Attention is one of the most powerful concepts in the deep learning space, and it has really changed the game. The core idea behind the attention mechanism is to use all interim hidden states of the RNN to decide which one to focus upon before the decoding stage uses it.

A more formal way of presenting attention is: Given a vector of values (all the hidden states of the RNN) and a query vector (this could be the decoder state), attention is a technique to compute a weighted sum of the values, dependent on the query.

The weighted sum acts as a selective summary of the information contained in the hidden states (value vectors), and the query decides which values to focus on. The roots of the attention mechanism can be found in the research associated with Neural Machine Translation (NMT) architectures. NMT models particularly struggled with alignment issues, and this is where attention greatly helped. For instance, the translation of a sentence from English to French may not match words one-to-one. Attention is not limited to NMT use cases only and is widely used across other NLP tasks, such as text generation and classification.

The idea is pretty straightforward, but how do we implement and use it? The figure below depicts a sample scenario of how an attention mechanism works; it demonstrates an unrolled RNN at time step $t$ .