Projection Pipeline
Current token: kernel
MLA components
New in V3.2
Modified in V3.2
Input Tokens
full model space ht in Rd_model
model
2048
MLA
Down-projection
2048 wide -> 512 latent
New
Indexer
scores tokens for sparse routing
MHA
Full Q / K / V Per Head
standard tensors, no latent compression
Q2048
K2048
V2048
Compressed
Q / KV Latent Space
narrow vectors cached or expanded on demand
cQ512
cKV512
kR64
New
Top-k Selection
select k=4 tokens
Modified
Attention Computation
dense MLA over cached latents
MLA
Up-projection
512 latent features -> model-width output
MHA
Concat Heads + Output Mix
join head outputs, then mix to model dimension
Cache
KV Cache
compressed latent + RoPE slice
Output
back in residual stream Rd_model
model
2048
MLA cache: 576 values/token
Dense pairs: 36
Sparse pairs: 24
Cache Footprint
576 values/tokenMHA
16.4k
MLA
576
DSA read
384
Full K/V
Latent \(c^{KV}\)
RoPE key
Attention Reconstruction
MLA caches compressed state
\(q^C_t \cdot k^C_j\)
\(q^R_t \cdot k^R_j\)
softmax weight