...

Masked Multi-Head Attention

Learn about the masked-multi head attention mechanism and how it works.

We'll cover the following...

Computing query, key, and value matrices
How masked multi-head attention works

By looking at the preceding dataset, we can understand that we have source and target sentences. We saw how the decoder predicts the target sentence word by word in each time step and that happens only during testing.

During training, since we have the right target sentence, we can just feed the whole target sentence as input to the decoder but with a small modification. We learned that the decoder takes the input $<sos>$ as the first token, and combines the next predicted word to the input on every time step for predicting the target sentence until the $<eos>$ token is reached. So, we can just add the $<sos>$ token to the beginning of our target sentence and send that as an input to the decoder.

Say we are converting the English sentence 'I am good' to the French sentence 'Je vais bien'. We can just add the $<sos>$ token to the beginning of the target sentence and send $<sos> \text{Je vais bien}$ as an input to the decoder, and then the decoder predicts the output as $\text{Je vais bien} <eos>$ , as shown in the following figure:

Press + to interact

But how does this work? Isn't this kind of ambiguous? Why do we need to feed the entire target sentence and let the decoder predict the shifted target sentence as output? Let's explore this in more detail.

We learned that instead of feeding the input directly to the decoder, we convert it into an embedding (output embedding matrix) and add positional encoding, and then feed it to the decoder. Let's suppose the following matrix, $X$ , is obtained as a result of adding the output embedding matrix and positional encoding:

Press + to interact

Source sentence	Target sentence
I am good	Je vais bien
Good morning	Bonjour
Thank you very much	Merci beaucoup

Before We Start

Starting Off with BERT

A Primer on Transformers

Understanding the BERT Model

Getting Hands-On with BERT

Exploring BERT Variants

Different BERT Variants

BERT Variants—Based on Knowledge Distillation

Applications of BERT

Exploring BERTSUM for Text Summarization

Semantic Search with Transformers

Applying BERT to Other Languages

Exploring Sentence and Domain-Specific BERT

Working with VideoBERT, BART, and More

Conclusion

Similarity Detection in English Language Using RoBERTa

Masked Multi-Head Attention

A sample training set

Computing query, key, and value matrices