DeepSeek V3.2 Multi-Head Latent Attention

Projection Pipeline

Current token: kernel

MLA components New in V3.2 Modified in V3.2

Input Tokens full model space h_t in R^d_model model 2048

MLA Down-projection 2048 wide -> 512 latent

New Indexer scores tokens for sparse routing

MHA Full Q / K / V Per Head standard tensors, no latent compression Q2048 K2048 V2048

Compressed Q / KV Latent Space narrow vectors cached or expanded on demand cQ512 cKV512 kR64

New Top-k Selection select k=4 tokens

Modified Attention Computation dense MLA over cached latents

MLA Up-projection 512 latent features -> model-width output

MHA Concat Heads + Output Mix join head outputs, then mix to model dimension

Cache KV Cache compressed latent + RoPE slice

Output back in residual stream R^d_model model 2048

MLA cache: 576 values/token Dense pairs: 36 Sparse pairs: 24

576 values/token

MHA

16.4k

MLA

576

DSA read

384

Full K/V Latent \(c^{KV}\) RoPE key

MLA caches compressed state

\(q^C_t \cdot k^C_j\) \(q^R_t \cdot k^R_j\) softmax weight