Attention

Learn to understand and implement attention mechanisms that enhance sequence-to-sequence models by dynamically weighting input parts. Discover the role of attention as a form of memory and alignment in neural networks, and explore its application in self-attention within transformers to improve natural language processing tasks.

We'll cover the following...

Attention in the encoder-decoder example
- Softmax
Attention in the intermediate representation z
Attention as an alignment between words
How do we compute attention?
Self-attention: the key component of the transformer architecture

Attention was born in order to address the limitations of Seq2Seq models.

The core idea is that the context vector $z$ should have access to all parts of the input sequence instead of just the last one.

In other words, we need to form a direct connection with each timestep.

This idea was originally proposed for computer vision. It was initially conceptualized like this: by looking at different parts of the image (glimpses), we can learn to accumulate information about a shape and classify the image accordingly.

The same principle was later extended to sequences. We can look at all the different words at the same time and learn to “pay attention“ to the correct ones depending on the task at hand.

This is what we now call attention. Attention is simply a notion of memory gained from attending at multiple inputs through time.

Let’s see it in action.

Attention in the encoder-decoder example

In the encoder-decoder RNN case, given previous state in the decoder as ${y}_{i-1}$ and the hidden state ${h} = [h_1,h_2, ... , h_{n} ]$ , we have something like this:

{e}_{i}={attention}\left({y}_{i-1}, {h} \right) \in R{^n}

The index $i$ indicates the prediction step. Essentially, we define a score (weighting) between the hidden state of the decoder and all the hidden states of the encoder.

More specifically, for each hidden state (denoted by $j$ ) ${h}_1,{h}_2,..., {h}_n$ , we will calculate a scalar:

e_{i j}={attention_{net}}\left({y}_{i-1}, h_{j}\right)

1.Learn Deep Learning

2.Neural Networks

3.Training Neural Networks

4.Convolutional Neural Networks

5.Recurrent Neural Networks

6.Autoencoders

7.Generative Adversarial Networks

8.Attention and Transformers

9.Graph Neural Networks

10.Conclusion

Assessment

Attention

Attention in the encoder-decoder example