Gradient Descent Variations

This visualization builds on the course idea that ML training minimizes a loss function iteratively using gradient descent.

Loss surface: f(x, y) = x² + 10y²

Class Connection

In lecture, gradient descent is introduced as an iterative method for minimizing loss. This page shows how changing the update rule affects the path, speed, and stability of convergence.

Hyperparameters

α  learning rate 0.070
0.005 0.200
Higher α: faster convergence, but may overshoot or diverge
Lower α: more stable, but needs more iterations
Animation Speed

Playback

Status

Iteration
0
Converged
0/4

Variations

Click the canvas to set a new start point.

Crank speed to 100× to see the full race.

Loss surface f = x² + 10y²

SGD and its variations

SGD is the core lecture concept — every other update rule below is a modification that adds a correction term to the same idea.

SGD
base
θ ← θ − α · ∇f

The lecture's core update rule. Zigzags when the surface is steep.

Momentum
+ velocity
v ← βv − α∇f
θ ← θ + v

SGD with memory of past steps. Damps oscillations over time.

RMSProp
adaptive α
s ← ρs + (1−ρ)g²
θ ← θ − α·g/√s

SGD with a per-axis learning rate. Straightens the path.

Adam
both
m̂ = m/(1−β₁ᵗ)
θ ← θ − α·m̂/√v̂

Momentum + RMSProp combined. The default in modern training.