Deep Dive: Decoding Math & The Combinatorial Explosion โผ
1 The Combinatorial Explosion
If an LLM vocabulary has 50,000 words, a 5-word sentence has $50,000^5$ combinations. Evaluating them all is mathematically impossible. Beam Search tames this by generating all next possibilities for active paths, but then ruthlessly pruning everything except the top k.
The "Greedy Trap": Greedy Search ($k=1$) falls into traps by picking a word that looks good immediately (like 'a') but leads to terrible continuations. Beam Search survives this by keeping 'runner-up' paths alive.
2 Log Probabilities & Length Penalty
Probabilities are decimals (0 to 1). Multiplying many tiny decimals causes computer chips to hit underflow (round to 0). Instead, LLMs add Log Probabilities ($\log P$). Because probabilities are $< 1$, Log Probs are always negative. Closer to 0 is better.
The Length Problem: Because we keep adding negative numbers, longer sentences will always have lower scores. To fix this, we apply Length Normalization (Score $\div$ Number of tokens) so the model doesn't unfairly favor artificially short sentences.
3 Temperature Sampling (Stochastic Decoding)
Instead of strictly picking the highest score, Sampling rolls a weighted dice. This is how ChatGPT generates creative, varied responses. The Temperature ($T$) scales the original probabilities before rolling the dice:
- โ๏ธ Low Temp ($T < 1$) Exaggerates differences. The most likely word becomes almost guaranteed (approaching Greedy behavior).
- ๐ฅ High Temp ($T > 1$) Flattens the distribution. Crazy, less likely words have a much higher chance of being picked.