Multihead Attention

Dive into the new concept of multihead attention, which allows transformers to capture diverse features and enhance interpretability.

Now, let's discuss a variant of self-attention or the attention mechanism in general, known as multihead attention. This concept is vital for encoding multiple features using the transformer model.

We've seen how queries (QQQ)Q (Query): This represents the information the model is looking for within the input sequence., keys (KKK)K (Key): Keys help establish relationships with other elements in the sequence, providing context and connections., and values (VVV)V (Value): Values hold the actual information content related to a particular element in the sequence. are projections of the same thing, essentially representing different views or features of the input, especially in the context of natural language processing (NLP). We'll explore this further.

Understanding multihead attention

For example, in NLP, we can examine the part-of-speech tag of a word and query its relationship with other part-of-speech tags in the same sentence. This is useful for understanding connections between named entities or resolving references.

Consider the example sentence, "The student didn't attempt the quiz because it was too difficult." In this sentence, a single word, for example, "it", can refer to different words based on the sentence's structure. Each type of attribute has its own unique projection.

The need for multiple features

What if we want to include more than one feature at once? We're not limited to a single projection. We aim to include various perspectives, similar to the concept of channels in convolutional neural networks. This is where multihead attention comes in.

Get hands-on with 1400+ tech skills courses.