Attention was born in order to address the limitations of Seq2Seq models.

The core idea is that the context vector zz should have access to all parts of the input sequence instead of just the last one.

In other words, we need to form a direct connection with each timestep.

This idea was originally proposed for computer vision. It was initially conceptualized like this: by looking at different parts of the image (glimpses), we can learn to accumulate information about a shape and classify the image accordingly.

The same principle was later extended to sequences. We can look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task at hand.

This is what we now call attention. Attention is simply a notion of memory gained from attending at multiple inputs through time.

Let’s see it in action.

Attention in the encoder-decoder example

In the encoder-decoder RNN case, given previous state in the decoder as yi1{y}_{i-1} and the the hidden state h=[h1,h2,...,hn]{h} = [h_1,h_2, ... , h_{n} ], we have something like this:

ei=attention(yi1,h)Rn{e}_{i}={attention}\left({y}_{i-1}, {h} \right) \in R{^n}

The index ii indicates the prediction step. Essentially, we define a score (weighting) between the hidden state of the decoder and all the hidden states of the encoder.

More specifically, for each hidden state (denoted by jj) h1,h2,...,hn{h}_1,{h}_2,..., {h}_n, we will calculate a scalar:

eij=attentionnet(yi1,hj)e_{i j}={attention_{net}}\left({y}_{i-1}, h_{j}\right)

Visually, in our example, we have something like this:

Get hands-on with 1300+ tech skills courses.