Visualizing how inputs become head projections.
The formula $MultiHead(Q,K,V)$ defines the logic. In Self-Attention, we plug our embeddings $X$ into all three inputs.
To compute $head_i$, we multiply the inputs by learned parameter matrices $W_i^Q, W_i^K,$ and $W_i^V$. These are the Linear Layers in the diagram.
After all heads calculate attention in parallel, they are concatenated and multiplied by $W^O$ to produce the final result.
Description goes here.