Training reference

Masked parallel prediction

Full target sequence can be fed at once. The causal mask blocks future positions.

Inference phase 1

Prefill

Process all prompt tokens together and create the initial K/V cache.

Inference phase 2

Decode

Generate one token at a time. Each new Query reads prior cached K/V.

STEP 0: TRAINING REFERENCE

Training-time masked attention

During training, masked self-attention allows parallel predictions while preventing future-token cheating.

KV cache enabled Rule: left only
K/V cache Memory

What is stored?

0 cached

Cache is empty in the training reference. It becomes important during inference.

Causal mask

Who can each position attend to?

1
Allowed
×
Future masked
-
Not active
Shape tracker

Attention Math

Attn(Q, K, V) = softmax( QKT / √d ) V
Q T × d
K, V available T × d
Attention scores T × T

In training and prefill, many positions can be processed together. In decode, the current token has one Query but attends across all cached K/V positions.

Simplified cost model

What changes?

Parallel work0 tokens
Sequential step0
K/V reads for Attn0 tokens
Recomputes Avoided0
💡 Insight: Start at the training reference. Then step into prefill and decode to see how the computational bottleneck shifts.
📚 How this connects to LLM Architecture

Understanding the shift from Prefill to Decode is critical for deploying modern LLMs. Strategies like batching, speculative decoding, and low-rank KV compression exist entirely to solve the bottlenecks you can see visually above.

1. The O(N²) vs O(N) Problem

Without a KV Cache, calculating the attention for the Nth token requires passing all N tokens through the model again (an O(N2) operation over the sequence).

By saving the K and V matrices (as seen in the Cache Memory grid), we only calculate Q for the *newest* token, making each step O(1 × N).

2. The "Memory Wall" in Decode

Look at the Cost Model during Decode: Parallel work drops to 1, but "K/V reads" continues to grow.

During Decode, the GPU's math units are mostly idle because they are waiting to read the massive KV cache from memory. Prefill is compute-bound, but Decode is notoriously memory-bandwidth bound.

3. Why Context Windows Cost RAM

Every token in the KV slot stores vectors for every layer and every attention head. In a 70B parameter model, a single token might take ~1MB of RAM.

A 128,000 token context window requires over 100GB of VRAM just for the cache! This is why architectures like Multi-Query Attention (MQA) and MLA were invented.