Multi-Head Attention

Visualizing how inputs become head projections.

The Functional Formula

The formula $MultiHead(Q,K,V)$ defines the logic. In Self-Attention, we plug our embeddings $X$ into all three inputs.

Projection Matrices ($W_i$)

To compute $head_i$, we multiply the inputs by learned parameter matrices $W_i^Q, W_i^K,$ and $W_i^V$. These are the Linear Layers in the diagram.

Final Aggregation ($W^O$)

After all heads calculate attention in parallel, they are concatenated and multiplied by $W^O$ to produce the final result.

Hover over blocks to see their mathematical role.