📝 Full-Batch Gradient Descent — Step by Step
Click ▶ Next Step to walk through one iteration phase by phase. The pseudo-code line currently executing lights up; the right panel shows the actual numbers being plugged into the exam formula.
Initialize w = w₀ (here: 4.5)for iteration in range(m): dL/dw = (1/n) Σ(ŷᵢ − yᵢ)xᵢ w = w − α · dL/dwreturn w
w ← w − α · dL/dw
⚖ Why Not Size = 1 or Size = n? — Batch Size Trade-offs
Drag the slider to feel the trade-off the exam asked about. The two ends are bad for opposite reasons; the sweet spot in the middle is why production training uses mini-batch.
drop_last=False (default) processes the partial batch normally and runs ⌈n/B⌉ updates per epoch; drop_last=True skips it for an integer ⌊n/B⌋ updates. §3 below uses a slightly different scheme — each step picks a fresh random subset of size B rather than walking a shuffled epoch — so its SGD path is genuinely stochastic across Resets, which is the whole point of demonstrating gradient noise.
B = n (Batch GD): exact, slow, memory-heavy
B ∈ [32, 256] (mini-batch): the production sweet spot
🏔 Convergence Paths on the Loss Surface
All three strategies start at the same w₀ = 4.5 and chase the optimum w* = 2.0. Watch how full-batch GD slides smoothly, mini-batch wobbles a little, and SGD zig-zags loudly on its way down.
🎛 Effect of Learning Rate α
Three full-batch runs with three different α values, animated side by side. The y-axis is loss, the x-axis is iteration.
α < 1 / E[x²] → smooth descent
α > 2 / E[x²] → divergent