Multi-Head Self-Attention

Explore how multi-head attention expands upon self-attention.

We'll cover the following...

The idea of self-attention can be expanded to multi-head attention. In essence, we run through the attention mechanism several times.

Each time, we map the independent set of Key, Query, Value matrices into different lower-dimensional spaces and compute the attention there. The individual output is called a “head”. The mapping is achieved by multiplying each matrix with a separate weight matrix, which is denoted as WiK,WiQRdmodel×dk{W}_{i}^{K} , {W}_{i}^{Q} \in R^{d_{model} \times d_{k} } and WiVRdmodel×dk{W}_{i}^{V} \in R^{d_{model} \times d_{k}} ...