Multi-Head Self-Attention
Explore how multi-head attention expands upon self-attention.
We'll cover the following...
The idea of self-attention can be expanded to multi-head attention. In essence, we run through the attention mechanism several times.
Each time, we map the independent set of Key, Query, Value matrices into different lower-dimensional spaces and compute the attention there. The individual output is called a “head”. The mapping is achieved by multiplying each matrix with a separate weight matrix, which is denoted as and ...