Loss Functions Explorer

From loss function to weight update: trace the chain rule derivation of ∂L/∂w step by step, then compare every major loss function — MSE, MAE, 0/1 loss, Hinge, Logistic (BCE), and Categorical Cross-Entropy — side by side.

🔗 Chain Rule: How to Compute ∂L/∂w

∂L/∂wj cannot be computed in one shot — the chain rule decomposes it into two simpler derivatives. Click ▶ Next to walk through each step of the derivation.

wj
weight
∂ŷ/∂w = xᵢ (per sample)
ŷ = Σwixi + b
prediction
∂L/∂ŷᵢ = (2/N)(ŷᵢ−yᵢ)
L = (1/N)Σ(ŷ−y)²
loss (MSE)
∂L/∂w = (2/N) Σᵢ (ŷᵢ − yᵢ) · xᵢ
← backpropagation direction
step 0 / 3

Current step

Click ▶ Next to begin the derivation.

Live numbers — N = 20, y = 2x + noise
w (slider) 3.00
mean(ŷ − y)
∂L/∂ŷᵢ = (2/N)(ŷᵢ−yᵢ) per sample
∂ŷ/∂w = xᵢ (each sample)
∂L/∂w = (2/N)Σ(ŷᵢ−yᵢ)·xᵢ

Key insight

∂ŷ/∂w = xᵢ — for each sample i, the partial derivative of the prediction with respect to w is simply the feature value xᵢ. The chain rule multiplies this with the error signal: "how wrong is the prediction across all samples?" × "what feature value drives w?" — giving ∂L/∂w = (2/N)Σ(ŷᵢ−yᵢ)·xᵢ.

📊 Loss Function Zoo: Shapes, Properties & Comparison

Every major loss function from the lecture — plotted interactively with formulas alongside. Use the sliders and draggable points to build intuition for the differences.

Regression Losses: MSE vs MAE

Drag the point left/right to feel how MSE and MAE respond differently to outliers. At e = 3, MSE penalizes 9× while MAE only penalizes 3×.

MSE — Squared Error

L = (1/N) Σ(ŷ − y)²

∂L/∂ŷ = (2/N)(ŷ − y)

  • Smooth & differentiable everywhere
  • Outliers dominate — quadratic growth

MAE — Absolute Error

L = (1/N) Σ|ŷ − y|

∂L/∂ŷ = sign(ŷ − y) / N

  • Robust to outliers — linear growth
  • Non-differentiable at e = 0

Current error e = ŷ − y

e2.000
MSE loss = e²4.000
∂MSE/∂ŷ+0.2000
MAE loss = |e|2.000
∂MAE/∂ŷ+0.0500

Classification Losses: 0/1, Hinge, Logistic (BCE)

All three plotted in the margin view: z = y · ŷ, where y ∈ {−1, +1}. When z > 0 the prediction is on the correct side; z > 1 means "confident and correct".

0/1 Loss — step function (non-differentiable)

-2 -1 0 1 2 3 0 1 margin z = y · ŷ z=1 loss = 1 (wrong) loss = 0 (correct)
Why 0/1 loss cannot be used with gradient descent: the gradient is 0 almost everywhere (flat), and undefined at z = 0. No gradient → no parameter update. We need a surrogate loss that is convex, differentiable, and upper-bounds 0/1 loss.

Loss values at current z

0/11
Hinge max(0, 1−z)0.500
Logistic log(1+e−z)0.474

Hinge Loss (SVM)

L = max(0, 1 − y · ŷ)

∂L/∂ŷ = −y if y·ŷ < 1

0 if y·ŷ ≥ 1

Gradient = 0 when z ≥ 1 (correct + in margin). Piecewise linear — sparse updates, efficient for SVM.

Logistic Loss / BCE

L = log(1 + e−z) (margin form, z = y·ŷ, y ∈ {−1,+1})

BCE form: −[y′ log p + (1−y′)log(1−p)]

(y′ ∈ {0,1}, p = σ(ŷ) — different label encoding)

∂L/∂ŷ = −y(1 − σ(y·ŷ)) (margin form)

Always smooth (C∞), gradient never exactly 0. "Soft margin" — even confidently correct points receive a small push.

Surrogate losses — the key idea

Hinge and Logistic are convex upper bounds of the 0/1 loss. Hinge is piecewise-linear (subdifferentiable, not differentiable at z = 1); Logistic is smooth everywhere (C∞). Minimizing either one also upper-bounds the true misclassification error, so gradient descent on a surrogate loss is a principled proxy for minimizing 0/1 loss.

Multi-Class: Categorical Cross-Entropy

Adjust the logits for 3 classes. Softmax converts them to probabilities; CCE penalizes the model for low confidence on the correct class.

z₁ (correct) 2.0
z₂ 0.0
z₃ -1.0

Categorical Cross-Entropy

L = −Σᵢ yᵢ · log(ŷᵢ)

= −log(pcorrect) (one-hot)

∂L/∂zᵢ = pᵢ − yᵢ

Gradient = predicted prob minus one-hot target. Elegantly simple and numerically stable with softmax.

Softmax

p̂ᵢ = ezᵢ / Σⱼ ezⱼ

Maps any real-valued logits to a valid probability distribution (all positive, sum = 1).

Current values

p₁ (correct)0.721
p₂0.212
p₃0.067
CCE Loss0.327

Note:

−log(p) → ∞ as p → 0 (very wrong = very large loss)

−log(p) → 0 as p → 1 (very confident = small loss) ✓

Softmax + CCE is the standard output layer for multi-class neural networks.

Expected Loss vs Average Loss

Expected Loss (theoretical)

L = E(x,y)~P[ℓ(f(x), y)]

  • Expectation over the true distribution P
  • The ultimate goal — minimize real-world error
  • P is unknown → cannot compute directly
  • We never have access to this during training

Average Loss (empirical / training)

L̂ = (1/N) Σᵢ ℓ(f(xᵢ), yᵢ)

  • Average over the N training examples
  • What gradient descent actually minimizes
  • A tractable proxy for expected loss
  • Can be made arbitrarily small by overfitting

Generalization gap = L − L̂

When training loss L̂ is low but test loss L is high, the model has overfit to the training sample — it has memorized the training data rather than learning the true pattern. Techniques that help close the gap: L2 / L1 regularization, dropout, data augmentation, early stopping.