Gradient Descent Explorer

📝 Full-Batch Gradient Descent — Step by Step

Click ▶ Next Step to walk through one iteration phase by phase. The pseudo-code line currently executing lights up; the right panel shows the actual numbers being plugged into the exam formula.

Initialize w = w₀  (here: 4.5)for iteration in range(m):    dL/dw = (1/n) Σ(ŷᵢ − yᵢ)xᵢ    w = w − α · dL/dwreturn w

Iteration: 0 / 20

Currently executing: —

n20

α0.10

dL/dw—

w (before)—

w (after)—

Loss— → —

α = 0.10 m = 20

dL/dw = (1/n) Σ(ŷᵢ − yᵢ)xᵢ // where ŷᵢ = w·xᵢ
w ← w − α · dL/dw

💡 Full-batch GD scans all n samples before each weight update, so the gradient is exact (zero variance) — but every update costs O(n) work.

⚖ Why Not Size = 1 or Size = n? — Batch Size Trade-offs

Drag the slider to feel the trade-off the exam asked about. The two ends are bad for opposite reasons; the sweet spot in the middle is why production training uses mini-batch.

Batch size B = 8 (n = 20)

GPU utilization (approx.)63 %

Gradient noise (1 / √B, normalised)35 %

Updates per epoch3

Updates per epoch — visual layout of n = 20 samples

Each [ ● ● ● ] processes together; the weights update once per group.

When n isn't divisible by B (e.g. B = 8, n = 20 → groups of 8, 8, 4), the last batch is smaller. PyTorch's drop_last=False (default) processes the partial batch normally and runs ⌈n/B⌉ updates per epoch; drop_last=True skips it for an integer ⌊n/B⌋ updates. §3 below uses a slightly different scheme — each step picks a fresh random subset of size B rather than walking a shuffled epoch — so its SGD path is genuinely stochastic across Resets, which is the whole point of demonstrating gradient noise.

B = 1 (SGD): high noise, no SIMD
B = n (Batch GD): exact, slow, memory-heavy
B ∈ [32, 256] (mini-batch): the production sweet spot

💡 The numbers above are illustrative — real GPU utilisation depends on architecture, kernel, and dtype. The qualitative trade-off (noise ↓ as B ↑, updates / epoch ↓ as B ↑) is universal.

🏔 Convergence Paths on the Loss Surface

All three strategies start at the same w₀ = 4.5 and chase the optimum w* = 2.0. Watch how full-batch GD slides smoothly, mini-batch wobbles a little, and SGD zig-zags loudly on its way down.

α = 0.10 Mini-batch B = 8

● Batch GD (B = n)

w = 4.500

Loss = —

● Mini-batch (B = 8)

w = 4.500

Loss = —

● SGD (B = 1)

w = 4.500

Loss = —

w_{t+1} = w_t − α · ĝ_t, ĝ_t = (1/B) Σ_{i ∈ batch} (w·xᵢ − yᵢ)·xᵢ

💡 Every strategy has the same long-run target. SGD's noisier path actually helps escape sharp minima in deep networks — but here on a smooth quadratic the wobble is pure cost.

🎛 Effect of Learning Rate α

Three full-batch runs with three different α values, animated side by side. The y-axis is loss, the x-axis is iteration.

α = 0.01 — too small

Steps are tiny; loss still drifting down after 50 iterations.

α = 0.10 — just right

Loss drops fast, then settles cleanly at the optimum.

α = 0.45 — too large

Step overshoots the minimum and the error grows — diverging.

For a 1-D quadratic loss with curvature E[x²]:
α < 1 / E[x²] → smooth descent
α > 2 / E[x²] → divergent

💡 In practice: start at α = 0.1, halve it whenever loss bumps up. Adam and friends adapt α per parameter, sparing you most of the manual tuning.