Gradient Descent Explorer

Step through full-batch gradient descent line by line, see why mini-batch sits in the sweet spot between SGD and full batch, watch all three converge on the loss landscape, and feel the effect of the learning rate α.

Linear regression on n = 20 synthetic points y ≈ 2x + noise. The exam formulas dL/dw = (1/n)Σ(ŷᵢ − yᵢ)xᵢ and w ← w − α · dL/dw drive every animation on this page.

📝 Full-Batch Gradient Descent — Step by Step

Click ▶ Next Step to walk through one iteration phase by phase. The pseudo-code line currently executing lights up; the right panel shows the actual numbers being plugged into the exam formula.

Initialize w = w₀  (here: 4.5)for iteration in range(m):    dL/dw = (1/n) Σ(ŷᵢ − yᵢ)xᵢ    w = w − α · dL/dwreturn w
Iteration: 0 / 20
Currently executing:
n20
α0.10
dL/dw
w (before)
w (after)
Loss
dL/dw = (1/n) Σ(ŷᵢ − yᵢ)xᵢ // where ŷᵢ = w·xᵢ
w ← w − α · dL/dw
💡 Full-batch GD scans all n samples before each weight update, so the gradient is exact (zero variance) — but every update costs O(n) work.

⚖ Why Not Size = 1 or Size = n? — Batch Size Trade-offs

Drag the slider to feel the trade-off the exam asked about. The two ends are bad for opposite reasons; the sweet spot in the middle is why production training uses mini-batch.

GPU utilization (approx.)63 %
Gradient noise (1 / √B, normalised)35 %
Updates per epoch3
Updates per epoch — visual layout of n = 20 samples
Each [ ● ● ● ] processes together; the weights update once per group.
When n isn't divisible by B (e.g. B = 8, n = 20 → groups of 8, 8, 4), the last batch is smaller. PyTorch's drop_last=False (default) processes the partial batch normally and runs ⌈n/B⌉ updates per epoch; drop_last=True skips it for an integer ⌊n/B⌋ updates. §3 below uses a slightly different scheme — each step picks a fresh random subset of size B rather than walking a shuffled epoch — so its SGD path is genuinely stochastic across Resets, which is the whole point of demonstrating gradient noise.
B = 1 (SGD): high noise, no SIMD
B = n (Batch GD): exact, slow, memory-heavy
B ∈ [32, 256] (mini-batch): the production sweet spot
💡 The numbers above are illustrative — real GPU utilisation depends on architecture, kernel, and dtype. The qualitative trade-off (noise ↓ as B ↑, updates / epoch ↓ as B ↑) is universal.

🏔 Convergence Paths on the Loss Surface

All three strategies start at the same w₀ = 4.5 and chase the optimum w* = 2.0. Watch how full-batch GD slides smoothly, mini-batch wobbles a little, and SGD zig-zags loudly on its way down.

● Batch GD (B = n)
w = 4.500
Loss =
● Mini-batch (B = 8)
w = 4.500
Loss =
● SGD (B = 1)
w = 4.500
Loss =
w_{t+1} = w_t − α · ĝ_t,  ĝ_t = (1/B) Σ_{i ∈ batch} (w·xᵢ − yᵢ)·xᵢ
💡 Every strategy has the same long-run target. SGD's noisier path actually helps escape sharp minima in deep networks — but here on a smooth quadratic the wobble is pure cost.

🎛 Effect of Learning Rate α

Three full-batch runs with three different α values, animated side by side. The y-axis is loss, the x-axis is iteration.

α = 0.01 — too small
Steps are tiny; loss still drifting down after 50 iterations.
α = 0.10 — just right
Loss drops fast, then settles cleanly at the optimum.
α = 0.45 — too large
Step overshoots the minimum and the error grows — diverging.
For a 1-D quadratic loss with curvature E[x²]:
α < 1 / E[x²] → smooth descent
α > 2 / E[x²] → divergent
💡 In practice: start at α = 0.1, halve it whenever loss bumps up. Adam and friends adapt α per parameter, sparing you most of the manual tuning.