This visualizer follows the slide convention used in class: a sequence with shape 3 × 512 is projected into Q, K, V, split across multiple heads, processed independently, and then concatenated back to the model dimension. The right-hand math panel uses a tiny toy example so you can see the score calculation without drowning in 512 numbers.
The input stays as a 3 × 512 matrix. Multi-head attention does not split the input first; it creates learned projections and then splits the projected space into heads.
Each projection is a learned matrix multiplication. The model learns Wq, Wk, and Wv
so that some directions become better for matching (Q/K) and some become better for carrying information (V).
In the 4-head slide example, 512 dimensions are split across heads as 128 per head. Each head learns to focus on different relationships, then the outputs are concatenated back.
Multi-head attention gives the model several smaller attention spaces instead of one large space. This makes it easier to capture different patterns at once.
This panel compresses the idea into a toy 4-dimensional example so the score path is visible: QKᵀ → scaled scores → softmax → weighted sum of V.