Multi-Head Attention Mechanism

Learn about the need for multiple attention matrices and how to compute them.

Instead of having a single attention head, we can use multiple attention heads. We learned how to compute the attention matrix ZZ. Instead of computing a single attention matrix, ZZ, we can compute multiple attention matrices. But what is the use of computing multiple attention matrices?

Let's understand this with an example. Consider the phrase:

Get hands-on with 1400+ tech skills courses.