Interactive decode path

Move through one autoregressive step and compare how standard MHA, MLA, and sparse MLA spend cache memory.

Cache reduction: 31.1x

Projection Pipeline

Current token: kernel
MLA components New in V3.2 Modified in V3.2
Input Tokens full model space ht in Rd_model model 2048
MLA Down-projection 2048 wide -> 512 latent
New Indexer scores tokens for sparse routing
MHA Full Q / K / V Per Head standard tensors, no latent compression Q2048 K2048 V2048
Compressed Q / KV Latent Space narrow vectors cached or expanded on demand cQ512 cKV512 kR64
New Top-k Selection select k=4 tokens
Modified Attention Computation dense MLA over cached latents
MLA Up-projection 512 latent features -> model-width output
MHA Concat Heads + Output Mix join head outputs, then mix to model dimension
Cache KV Cache compressed latent + RoPE slice
Output back in residual stream Rd_model model 2048
MLA cache: 576 values/token Dense pairs: 36 Sparse pairs: 24

Cache Footprint

576 values/token
MHA
16.4k
MLA
576
DSA read
384
Full K/V Latent \(c^{KV}\) RoPE key

Attention Reconstruction

MLA caches compressed state
\(q^C_t \cdot k^C_j\) \(q^R_t \cdot k^R_j\) softmax weight

Why the latent cache works