Transformer Inference: Prefill vs Decode Visualizer

Training reference

Masked parallel prediction

Full target sequence can be fed at once. The causal mask blocks future positions.

Inference phase 1

Prefill

Process all prompt tokens together and create the initial K/V cache.

Inference phase 2

Decode

Generate one token at a time. Each new Query reads prior cached K/V.

STEP 0: TRAINING REFERENCE

Training-time masked attention

During training, masked self-attention allows parallel predictions while preventing future-token cheating.

KV cache enabled Rule: left only

K/V cache Memory

What is stored?

0 cached

Cache is empty in the training reference. It becomes important during inference.

Causal mask

Who can each position attend to?

Allowed

Future masked

Not active

Shape tracker

Attention Math

Attn(Q, K, V) = softmax( QK^T / √d ) V

Q T × d

K, V available T × d

Attention scores T × T

In training and prefill, many positions can be processed together. In decode, the current token has one Query but attends across all cached K/V positions.

Simplified cost model

What changes?

Parallel work0 tokens

Sequential step0

K/V reads for Attn0 tokens

Recomputes Avoided0

💡 Insight: Start at the training reference. Then step into prefill and decode to see how the computational bottleneck shifts.