Masked parallel prediction
Full target sequence can be fed at once. The causal mask blocks future positions.
Prefill
Process all prompt tokens together and create the initial K/V cache.
Decode
Generate one token at a time. Each new Query reads prior cached K/V.
Training-time masked attention
During training, masked self-attention allows parallel predictions while preventing future-token cheating.
What is stored?
Cache is empty in the training reference. It becomes important during inference.
Who can each position attend to?
Attention Math
In training and prefill, many positions can be processed together. In decode, the current token has one Query but attends across all cached K/V positions.