Demystifying Softmax

Visualizing Logits vs. Softmax vs. Hard Max

Temperature & The Flattening Effect

Temperature ($T$) acts as a smoothing factor. By dividing logits by $T$ before exponentiation, we scale the relative differences between scores.

  • Sharpening ($T \to 0$): Small differences in logits are amplified. The winner dominates, approaching Hard Max.
  • Standard ($T = 1$): The original relative differences are preserved through $e^z$.
  • Flattening ($T \to \infty$): Logits are "crushed" toward zero. The distribution becomes uniform ($1/K$), meaning the model becomes uncertain and "creative."

The Math

$$\sigma(z, T)_i = \frac{e^{z_i/T}}{\sum_{j=1}^K e^{z_j/T}}$$

As $T \to \infty$, $z_i/T \to 0$, so $e^{z_i/T} \to 1$. Every class then gets a probability of $1/K$.

Interactive Lab

Broad = Logits. Mid = Softmax. Thin = Hard Max.

Temp ($T$) 1.0
CAT (Logit 1) 5.0
DOG (Logit 2) 1.0
BIRD (Logit 3) 0.0

Exponentials ($e^{z/T}$)

Softmax Probabilities

Hard Max (Winner)