LLM Decoding Strategies Explorer

Compare greedy, temperature, top-k, top-p, beam search, contrastive, and speculative decoding with live probability distributions and full autoregressive token generation.

GPT-2-style decoder continuation. Context so far: "The cat sat" → predicting next tokens (the canonical example used in Part 2 lecture slides 6 / 11–22).

🎯 Greedy Decoding

At each position, pick the single token with the highest probability. After picking, advance to the next position and re-render its distribution. Deterministic and fast.

Current position: —

Generated sequence

x_t = argmax_{v ∈ V} P(v | x_1..t-1)

💡 Deterministic, O(1) per step. Every run produces the same sentence — no exploration of alternatives.

🌡 Temperature Sampling

Scale logits by 1/T before softmax. Each Generate Next Token click samples from the current position and advances to the next one — the bar chart re-renders for the new position.

T = 1.0

Current position: —

Generated sequence

P_T(v) = softmax(logit_v / T)

💡 Each run produces a different sentence. Low T → peaked / repetitive; high T → creative but incoherent.

Top-k Sampling

Keep only the k highest-probability tokens, renormalize, then sample. After sampling, advance to the next position.

k = 5

Current position: —

Generated sequence

keep top-k tokens; mask the rest; renormalize; sample

💡 Simple but not adaptive — k stays constant even when only 1 token really matters.

Top-p (Nucleus) Sampling

Keep the smallest set of tokens whose cumulative probability ≥ p. Nucleus size adapts to the distribution shape.

p = 0.90

Cumulative probabilityNucleus: — tokens

0%p = threshold100%

Current position: —

Generated sequence

smallest set {v : Σ P(v) ≥ p} in descending P

💡 Watch the nucleus size change across positions — sharp distributions shrink it, flat ones widen it.

🌳 Beam Search

Keep the B highest-scoring partial sequences at each step. Score is the sum of log-probabilities along the path.

Beam width B = 3

Press ▶ Next Step to begin

score(x_1..t) = Σ log P(x_i | x_1..i-1)

💡 Each step expands the B active beams → B×|V| candidates → keep top-B. Short and generic outputs tend to win without length normalization.

⚠ Beam collapse you can see above: the top-B paths all share the same prefix (on the soft …) and only diverge late. Why? At step 1 the top token (on) leads the runner-up (in) by ~1.8 log-units, and every subsequent position uses the same distribution regardless of prefix (the demo is unconditioned). In real LLMs, P(the | "...sat on") ≠ P(the | "...sat in"), which lets beams genuinely diverge — but beam search still collapses in practice. That's why modern LLMs prefer sampling (temperature / top-p) over beam, and when beam is needed, apply length normalization (score / length^α) or diverse beam search (penalize beams that share prefixes).

⚖ Contrastive Decoding

Pick tokens the expert prefers more than a small amateur model. Penalizes tokens the amateur likes (generic/high-frequency). Each Generate Next Token advances one position.

α = 0.10

Current position: —

Expert P_exp

Amateur P_ama

CD score

Generated sequence

score(v) = log P_exp(v) − α · log P_ama(v)

💡 α = 0 is identical to expert greedy. As α grows, tokens the amateur finds easy are pushed down — the output sentence may diverge from greedy.

⚡ Speculative Decoding

A fast draft model proposes γ tokens up front (committing all γ regardless of whether they'll be accepted); the slow target model then verifies all γ in one batched forward pass. Accept with probability P_target/P_draft; on the first reject, sample a correction and discard the rest. Output distribution equals plain target sampling.

γ (draft length, # of tokens proposed per Step click) = 4

Draft

Target verify

Result

Steps

Tokens generated

Avg tokens / step

—

Accepted sequence

accept x̃_t if U[0,1] < min(1, P_target(x̃_t) / P_draft(x̃_t))
on reject: sample from normalize(max(0, P_target − P_draft))

💡 Avg tokens/step > 1 means speedup: the target runs once but emits multiple accepted tokens.