🚨 Vanishing & Exploding Gradients
Deep neural networks are effectively long chains of matrix multiplications. If we initialize the weight matrices poorly, the activations will either shrink to zero or grow to infinity as they pass through the layers.
📉 Vanishing Activations (Too Small)
If the initialized weights are too small (e.g., variance < 1/N), multiplying inputs by weights repeatedly shrinks the values. By Layer 5, all activations collapse to 0.
L2_Var = 0.25
...
L5_Var = 0.00001 (Gradient signal dies)
🌋 Exploding Activations (Too Big)
If weights are too large (e.g., variance > 1/N), the values multiply exponentially. By Layer 5, activations shoot to infinity, causing numerical overflow.
L2_Var = 4.0
...
L5_Var = 1024.0 (Overflow to NaN)
🎯 The Goal of Initialization
We want the variance of the activations to remain constant across all layers.
If variance is preserved during the forward pass, gradients will be preserved during the backward pass, allowing the network to train efficiently without dying or overflowing.
📐 The Mathematics of Stability
How do we guarantee variance stays constant? By scaling the random weights based on the number of input connections (fan-in, denoted as N).
Best for Tanh / Sigmoid / Linear
Best for ReLU / Leaky ReLU
📊 Live Forward Pass Simulation
Configure the network and click Run Forward Pass to simulate 5,000 data points passing through 5 layers. Watch how the distribution of activations behaves.
⚡ Hardware Systems & Quantization Impact
Weight initialization is not just an algorithmic training trick. For EE508 students, it is the first line of defense for hardware efficiency and lower precision quantization.
NaN, permanently halting training.
🏟 Hardware Dynamic Range
Proper initialization (Xavier/He) perfectly bounds activations within the safe representable range of the hardware format, allowing safe scale-factors for INT8 quantization without clipping or underflow.