Gradient Descent Variations

This visualization builds on the course idea that ML training minimizes a loss function iteratively using gradient descent.

Loss surface: f(x, y) = x² + 10y²

Class Connection

In lecture, gradient descent is introduced as an iterative method for minimizing loss. This page shows how changing the update rule affects the path, speed, and stability of convergence.

Hyperparameters

α learning rate 0.070

0.005 0.200

Higher α: faster convergence, but may overshoot or diverge

Lower α: more stable, but needs more iterations

Animation Speed

Playback

Status

Iteration

Converged

0/4

Variations

Click the canvas to set a new start point.

Crank speed to 100× to see the full race.

Loss surface f = x² + 10y² · min at (0, 0)

SGD and its variations

SGD is the core lecture concept — every other update rule below is a modification that adds a correction term to the same idea.

SGD

base

θ ← θ − α · ∇f

The lecture's core update rule. Zigzags when the surface is steep.

Momentum

+ velocity

v ← βv − α∇f
θ ← θ + v

SGD with memory of past steps. Damps oscillations over time.

RMSProp

adaptive α

s ← ρs + (1−ρ)g²
θ ← θ − α·g/√s

SGD with a per-axis learning rate. Straightens the path.

Adam

both

m̂ = m/(1−β₁ᵗ)
θ ← θ − α·m̂/√v̂

Momentum + RMSProp combined. The default in modern training.