Loss Functions Explorer

🔗 Chain Rule: How to Compute ∂L/∂w

∂L/∂w_j cannot be computed in one shot — the chain rule decomposes it into two simpler derivatives. Click ▶ Next to walk through each step of the derivation.

w_j

weight

∂ŷ/∂w = xᵢ (per sample)

ŷ = Σw_ix_i + b

prediction

∂L/∂ŷᵢ = (2/N)(ŷᵢ−yᵢ)

L = (1/N)Σ(ŷ−y)²

loss (MSE)

∂L/∂w = (2/N) Σᵢ (ŷᵢ − yᵢ) · xᵢ

← backpropagation direction

step 0 / 3

Current step

Click ▶ Next to begin the derivation.

Live numbers — N = 20, y = 2x + noise

w (slider) 3.00

mean(ŷ − y) —

∂L/∂ŷᵢ = (2/N)(ŷᵢ−yᵢ) per sample —

∂ŷ/∂w = xᵢ (each sample) —

∂L/∂w = (2/N)Σ(ŷᵢ−yᵢ)·xᵢ —

w = 3.00

Key insight

∂ŷ/∂w = xᵢ — for each sample i, the partial derivative of the prediction with respect to w is simply the feature value xᵢ. The chain rule multiplies this with the error signal: "how wrong is the prediction across all samples?" × "what feature value drives w?" — giving ∂L/∂w = (2/N)Σ(ŷᵢ−yᵢ)·xᵢ.

📊 Loss Function Zoo: Shapes, Properties & Comparison

Every major loss function from the lecture — plotted interactively with formulas alongside. Use the sliders and draggable points to build intuition for the differences.

Regression Losses: MSE vs MAE

Drag the point left/right to feel how MSE and MAE respond differently to outliers. At e = 3, MSE penalizes 9× while MAE only penalizes 3×.

e = 2.000

MSE — Squared Error

L = (1/N) Σ(ŷ − y)²

∂L/∂ŷ = (2/N)(ŷ − y)

Smooth & differentiable everywhere
Outliers dominate — quadratic growth

MAE — Absolute Error

L = (1/N) Σ|ŷ − y|

∂L/∂ŷ = sign(ŷ − y) / N

Robust to outliers — linear growth
Non-differentiable at e = 0

Current error e = ŷ − y

e2.000

MSE loss = e²4.000

∂MSE/∂ŷ+0.2000

MAE loss = |e|2.000

∂MAE/∂ŷ+0.0500

Classification Losses: 0/1, Hinge, Logistic (BCE)

All three plotted in the margin view: z = y · ŷ, where y ∈ {−1, +1}. When z > 0 the prediction is on the correct side; z > 1 means "confident and correct".

0/1 Loss — step function (non-differentiable)

Why 0/1 loss cannot be used with gradient descent: the gradient is 0 almost everywhere (flat), and undefined at z = 0. No gradient → no parameter update. We need a surrogate loss that is convex, differentiable, and upper-bounds 0/1 loss.

z = 0.50

Loss values at current z

0/11

Hinge max(0, 1−z)0.500

Logistic log(1+e^−z)0.474

Hinge Loss (SVM)

L = max(0, 1 − y · ŷ)

∂L/∂ŷ = −y if y·ŷ < 1

0 if y·ŷ ≥ 1

Gradient = 0 when z ≥ 1 (correct + in margin). Piecewise linear — sparse updates, efficient for SVM.

Logistic Loss / BCE

L = log(1 + e^−z) (margin form, z = y·ŷ, y ∈ {−1,+1})

BCE form: −[y′ log p + (1−y′)log(1−p)]

(y′ ∈ {0,1}, p = σ(ŷ) — different label encoding)

∂L/∂ŷ = −y(1 − σ(y·ŷ)) (margin form)

Always smooth (C∞), gradient never exactly 0. "Soft margin" — even confidently correct points receive a small push.

Surrogate losses — the key idea

Hinge and Logistic are convex upper bounds of the 0/1 loss. Hinge is piecewise-linear (subdifferentiable, not differentiable at z = 1); Logistic is smooth everywhere (C∞). Minimizing either one also upper-bounds the true misclassification error, so gradient descent on a surrogate loss is a principled proxy for minimizing 0/1 loss.

Multi-Class: Categorical Cross-Entropy

Adjust the logits for 3 classes. Softmax converts them to probabilities; CCE penalizes the model for low confidence on the correct class.

z₁ (correct) 2.0

z₂ 0.0

z₃ -1.0

Categorical Cross-Entropy

L = −Σᵢ yᵢ · log(ŷᵢ)

= −log(p_correct) (one-hot)

∂L/∂zᵢ = pᵢ − yᵢ

Gradient = predicted prob minus one-hot target. Elegantly simple and numerically stable with softmax.

Softmax

p̂ᵢ = e^zᵢ / Σⱼ e^zⱼ

Maps any real-valued logits to a valid probability distribution (all positive, sum = 1).

Current values

p₁ (correct)0.721

p₂0.212

p₃0.067

CCE Loss0.327

Note:

−log(p) → ∞ as p → 0 (very wrong = very large loss)

−log(p) → 0 as p → 1 (very confident = small loss) ✓

Softmax + CCE is the standard output layer for multi-class neural networks.

Expected Loss vs Average Loss

Expected Loss (theoretical)

L = E_(x,y)~P[ℓ(f(x), y)]

Expectation over the true distribution P
The ultimate goal — minimize real-world error
P is unknown → cannot compute directly
We never have access to this during training

Average Loss (empirical / training)

L̂ = (1/N) Σᵢ ℓ(f(xᵢ), yᵢ)

Average over the N training examples
What gradient descent actually minimizes
A tractable proxy for expected loss
Can be made arbitrarily small by overfitting

Generalization gap = L − L̂

When training loss L̂ is low but test loss L is high, the model has overfit to the training sample — it has memorized the training data rather than learning the true pattern. Techniques that help close the gap: L2 / L1 regularization, dropout, data augmentation, early stopping.