Demystifying Softmax

Visualizing Logits vs. Softmax vs. Hard Max

Temperature & The Flattening Effect

Temperature ($T$) acts as a smoothing factor. By dividing logits by $T$ before exponentiation, we scale the relative differences between scores.

Sharpening ($T \to 0$): Small differences in logits are amplified. The winner dominates, approaching Hard Max.
Standard ($T = 1$): The original relative differences are preserved through $e^z$.
Flattening ($T \to \infty$): Logits are "crushed" toward zero. The distribution becomes uniform ($1/K$), meaning the model becomes uncertain and "creative."

$$\sigma(z, T)_i = \frac{e^{z_i/T}}{\sum_{j=1}^K e^{z_j/T}}$$

As $T \to \infty$, $z_i/T \to 0$, so $e^{z_i/T} \to 1$. Every class then gets a probability of $1/K$.

Broad = Logits. Mid = Softmax. Thin = Hard Max.

Temp ($T$) 1.0

Show Hard Max

CAT (Logit 1) 5.0

DOG (Logit 2) 1.0

BIRD (Logit 3) 0.0