Multi-Head Self-Attention
Explore how multi-head attention expands upon self-attention.
The idea of self-attention can be expanded to multi-head attention. In essence, we run through the attention mechanism several times.
Each time, we map the independent set of Key, Query, Value matrices into different lower-dimensional spaces and compute the attention there. The individual output is called a “head”. The mapping is achieved by multiplying each matrix with a separate weight matrix, which is denoted as and , where is the head index.
To compensate for the extra complexity, the output vector size is divided by the number of heads. Specifically, in the vanilla transformer, they use and heads, which gives us vector representations of .
With multi-head attention, the model has multiple independent paths (ways) to understand the input.
The heads are then concatenated and transformed using a square weight matrix , since .
Putting it all together, we get:
where
where again:
Since heads are independent of each other, we can perform the self-attention computation in parallel on different workers:
Get hands-on with 1300+ tech skills courses.