LLM Decoding Strategies Explorer

Compare greedy, temperature, top-k, top-p, beam search, contrastive, and speculative decoding with live probability distributions and full autoregressive token generation.

GPT-2-style decoder continuation. Context so far: "The cat sat" → predicting next tokens (the canonical example used in Part 2 lecture slides 6 / 11–22).

🎯 Greedy Decoding

At each position, pick the single token with the highest probability. After picking, advance to the next position and re-render its distribution. Deterministic and fast.

Current position:
Generated sequence
xt = argmaxv ∈ V P(v | x1..t-1)
💡 Deterministic, O(1) per step. Every run produces the same sentence — no exploration of alternatives.

🌡 Temperature Sampling

Scale logits by 1/T before softmax. Each Generate Next Token click samples from the current position and advances to the next one — the bar chart re-renders for the new position.

Current position:
Generated sequence
PT(v) = softmax(logitv / T)
💡 Each run produces a different sentence. Low T → peaked / repetitive; high T → creative but incoherent.

Top-k Sampling

Keep only the k highest-probability tokens, renormalize, then sample. After sampling, advance to the next position.

Current position:
Generated sequence
keep top-k tokens; mask the rest; renormalize; sample
💡 Simple but not adaptive — k stays constant even when only 1 token really matters.

Top-p (Nucleus) Sampling

Keep the smallest set of tokens whose cumulative probability ≥ p. Nucleus size adapts to the distribution shape.

Cumulative probabilityNucleus: — tokens
0%p = threshold100%
Current position:
Generated sequence
smallest set {v : Σ P(v) ≥ p} in descending P
💡 Watch the nucleus size change across positions — sharp distributions shrink it, flat ones widen it.

🌳 Beam Search

Keep the B highest-scoring partial sequences at each step. Score is the sum of log-probabilities along the path.

Press ▶ Next Step to begin
score(x1..t) = Σ log P(xi | x1..i-1)
💡 Each step expands the B active beams → B×|V| candidates → keep top-B. Short and generic outputs tend to win without length normalization.
Beam collapse you can see above: the top-B paths all share the same prefix (on the soft …) and only diverge late. Why? At step 1 the top token (on) leads the runner-up (in) by ~1.8 log-units, and every subsequent position uses the same distribution regardless of prefix (the demo is unconditioned). In real LLMs, P(the | "...sat on")P(the | "...sat in"), which lets beams genuinely diverge — but beam search still collapses in practice. That's why modern LLMs prefer sampling (temperature / top-p) over beam, and when beam is needed, apply length normalization (score / lengthα) or diverse beam search (penalize beams that share prefixes).

⚖ Contrastive Decoding

Pick tokens the expert prefers more than a small amateur model. Penalizes tokens the amateur likes (generic/high-frequency). Each Generate Next Token advances one position.

Current position:
Expert Pexp
Amateur Pama
CD score
Generated sequence
score(v) = log Pexp(v) − α · log Pama(v)
💡 α = 0 is identical to expert greedy. As α grows, tokens the amateur finds easy are pushed down — the output sentence may diverge from greedy.

⚡ Speculative Decoding

A fast draft model proposes γ tokens up front (committing all γ regardless of whether they'll be accepted); the slow target model then verifies all γ in one batched forward pass. Accept with probability Ptarget/Pdraft; on the first reject, sample a correction and discard the rest. Output distribution equals plain target sampling.

Draft
Target verify
Result
Steps
0
Tokens generated
0
Avg tokens / step
Accepted sequence
accept x̃t if U[0,1] < min(1, Ptarget(x̃t) / Pdraft(x̃t))
on reject: sample from normalize(max(0, Ptarget − Pdraft))
💡 Avg tokens/step > 1 means speedup: the target runs once but emits multiple accepted tokens.