🎯 Greedy Decoding
At each position, pick the single token with the highest probability. After picking, advance to the next position and re-render its distribution. Deterministic and fast.
🌡 Temperature Sampling
Scale logits by 1/T before softmax. Each Generate Next Token click samples from the current position and advances to the next one — the bar chart re-renders for the new position.
Top-k Sampling
Keep only the k highest-probability tokens, renormalize, then sample. After sampling, advance to the next position.
Top-p (Nucleus) Sampling
Keep the smallest set of tokens whose cumulative probability ≥ p. Nucleus size adapts to the distribution shape.
🌳 Beam Search
Keep the B highest-scoring partial sequences at each step. Score is the sum of log-probabilities along the path.
on the soft …) and only diverge late.
Why? At step 1 the top token (on) leads the runner-up (in) by ~1.8 log-units, and every subsequent position uses the same distribution regardless of prefix (the demo is unconditioned).
In real LLMs, P(the | "...sat on") ≠ P(the | "...sat in"), which lets beams genuinely diverge — but beam search still collapses in practice.
That's why modern LLMs prefer sampling (temperature / top-p) over beam, and when beam is needed, apply length normalization (score / lengthα) or diverse beam search (penalize beams that share prefixes).
⚖ Contrastive Decoding
Pick tokens the expert prefers more than a small amateur model. Penalizes tokens the amateur likes (generic/high-frequency). Each Generate Next Token advances one position.
⚡ Speculative Decoding
A fast draft model proposes γ tokens up front (committing all γ regardless of whether they'll be accepted); the slow target model then verifies all γ in one batched forward pass. Accept with probability Ptarget/Pdraft; on the first reject, sample a correction and discard the rest. Output distribution equals plain target sampling.
on reject: sample from normalize(max(0, Ptarget − Pdraft))