🔗 Chain Rule: How to Compute ∂L/∂w
∂L/∂wj cannot be computed in one shot — the chain rule decomposes it into two simpler derivatives. Click ▶ Next to walk through each step of the derivation.
Current step
Click ▶ Next to begin the derivation.
Key insight
∂ŷ/∂w = xᵢ — for each sample i, the partial derivative of the prediction with respect to w is simply the feature value xᵢ. The chain rule multiplies this with the error signal: "how wrong is the prediction across all samples?" × "what feature value drives w?" — giving ∂L/∂w = (2/N)Σ(ŷᵢ−yᵢ)·xᵢ.
📊 Loss Function Zoo: Shapes, Properties & Comparison
Every major loss function from the lecture — plotted interactively with formulas alongside. Use the sliders and draggable points to build intuition for the differences.
Regression Losses: MSE vs MAE
Drag the point left/right to feel how MSE and MAE respond differently to outliers. At e = 3, MSE penalizes 9× while MAE only penalizes 3×.
MSE — Squared Error
L = (1/N) Σ(ŷ − y)²
∂L/∂ŷ = (2/N)(ŷ − y)
- Smooth & differentiable everywhere
- Outliers dominate — quadratic growth
MAE — Absolute Error
L = (1/N) Σ|ŷ − y|
∂L/∂ŷ = sign(ŷ − y) / N
- Robust to outliers — linear growth
- Non-differentiable at e = 0
Current error e = ŷ − y
Classification Losses: 0/1, Hinge, Logistic (BCE)
All three plotted in the margin view: z = y · ŷ, where y ∈ {−1, +1}. When z > 0 the prediction is on the correct side; z > 1 means "confident and correct".
0/1 Loss — step function (non-differentiable)
Loss values at current z
Hinge Loss (SVM)
L = max(0, 1 − y · ŷ)
∂L/∂ŷ = −y if y·ŷ < 1
0 if y·ŷ ≥ 1
Gradient = 0 when z ≥ 1 (correct + in margin). Piecewise linear — sparse updates, efficient for SVM.
Logistic Loss / BCE
L = log(1 + e−z) (margin form, z = y·ŷ, y ∈ {−1,+1})
BCE form: −[y′ log p + (1−y′)log(1−p)]
(y′ ∈ {0,1}, p = σ(ŷ) — different label encoding)
∂L/∂ŷ = −y(1 − σ(y·ŷ)) (margin form)
Always smooth (C∞), gradient never exactly 0. "Soft margin" — even confidently correct points receive a small push.
Surrogate losses — the key idea
Hinge and Logistic are convex upper bounds of the 0/1 loss. Hinge is piecewise-linear (subdifferentiable, not differentiable at z = 1); Logistic is smooth everywhere (C∞). Minimizing either one also upper-bounds the true misclassification error, so gradient descent on a surrogate loss is a principled proxy for minimizing 0/1 loss.
Multi-Class: Categorical Cross-Entropy
Adjust the logits for 3 classes. Softmax converts them to probabilities; CCE penalizes the model for low confidence on the correct class.
Categorical Cross-Entropy
L = −Σᵢ yᵢ · log(ŷᵢ)
= −log(pcorrect) (one-hot)
∂L/∂zᵢ = pᵢ − yᵢ
Gradient = predicted prob minus one-hot target. Elegantly simple and numerically stable with softmax.
Softmax
p̂ᵢ = ezᵢ / Σⱼ ezⱼ
Maps any real-valued logits to a valid probability distribution (all positive, sum = 1).
Current values
Note:
−log(p) → ∞ as p → 0 (very wrong = very large loss)
−log(p) → 0 as p → 1 (very confident = small loss) ✓
Softmax + CCE is the standard output layer for multi-class neural networks.
Expected Loss vs Average Loss
Expected Loss (theoretical)
L = E(x,y)~P[ℓ(f(x), y)]
- Expectation over the true distribution P
- The ultimate goal — minimize real-world error
- P is unknown → cannot compute directly
- We never have access to this during training
Average Loss (empirical / training)
L̂ = (1/N) Σᵢ ℓ(f(xᵢ), yᵢ)
- Average over the N training examples
- What gradient descent actually minimizes
- A tractable proxy for expected loss
- Can be made arbitrarily small by overfitting
Generalization gap = L − L̂
When training loss L̂ is low but test loss L is high, the model has overfit to the training sample — it has memorized the training data rather than learning the true pattern. Techniques that help close the gap: L2 / L1 regularization, dropout, data augmentation, early stopping.