Weight Initialization - EE508 Interactive Visualizer

🚨 Vanishing & Exploding Gradients

Deep neural networks are effectively long chains of matrix multiplications. If we initialize the weight matrices poorly, the activations will either shrink to zero or grow to infinity as they pass through the layers.

📉 Vanishing Activations (Too Small)

If the initialized weights are too small (e.g., variance < 1/N), multiplying inputs by weights repeatedly shrinks the values. By Layer 5, all activations collapse to 0.

          L1_Var = 1.0

          L2_Var = 0.25

          ...

          L5_Var = 0.00001 (Gradient signal dies)

🌋 Exploding Activations (Too Big)

If weights are too large (e.g., variance > 1/N), the values multiply exponentially. By Layer 5, activations shoot to infinity, causing numerical overflow.

          L1_Var = 1.0

          L2_Var = 4.0

          ...

          L5_Var = 1024.0 (Overflow to NaN)

🎯 The Goal of Initialization

We want the variance of the activations to remain constant across all layers.

Var(X) ≈ Var(H₁) ≈ ... ≈ Var(H_L)

If variance is preserved during the forward pass, gradients will be preserved during the backward pass, allowing the network to train efficiently without dying or overflowing.

📐 The Mathematics of Stability

How do we guarantee variance stays constant? By scaling the random weights based on the number of input connections (fan-in, denoted as N).

Xavier (Glorot) Initialization

Best for Tanh / Sigmoid / Linear

Var(W) =

Draw weights from a distribution where the variance is exactly 1 / N. Because we sum N inputs, multiplying by 1 / N keeps the total output variance perfectly equal to 1.0.

He Initialization

Best for ReLU / Leaky ReLU

Var(W) =

ReLU sets exactly half of the activations to 0, cutting the forward variance in half. To compensate, we multiply the weight variance by 2 to restore balance.

🔑 Key Insight: If you use Xavier initialization with ReLU, the variance will slowly halve at every layer, eventually vanishing. You must pair the initialization method with the correct activation function!

📊 Live Forward Pass Simulation

Configure the network and click Run Forward Pass to simulate 5,000 data points passing through 5 layers. Watch how the distribution of activations behaves.

Weight Initialization

Activation Function

Layer 1

Std: --

-303

Layer 2

Std: --

-303

Layer 3

Std: --

-303

Layer 4

Std: --

-303

Layer 5 (Output)

Std: --

-303

💡 Simulation Ready: Choose your settings above and hit run. Try Xavier + Linear vs Xavier + ReLU to see why He initialization was invented!

⚡ Hardware Systems & Quantization Impact

Weight initialization is not just an algorithmic training trick. For EE508 students, it is the first line of defense for hardware efficiency and lower precision quantization.

💥

FP16 Overflow (Exploding)

Modern GPUs accelerate training using FP16 (Half Precision) or FP8 Tensor Cores. FP16 has a maximum representable value of ~65,504. If poor initialization causes activations to explode, the hardware instantly overflows to NaN, permanently halting training.

⚛️

INT8 Underflow (Vanishing)

When deploying to edge devices, we quantize weights and activations to INT8 (range [-128, 127]). If activations vanish to near-zero (e.g., std = 0.001), quantizing them to integers destroys all signal, collapsing the entire network output to a dead zero state.

🔋

Wasted Compute Epochs

If gradients are unstable, the network requires orders of magnitude more epochs to converge (or requires smaller learning rates). In a cloud setting, wasting 50 epochs due to bad initialization equates to thousands of dollars in wasted GPU power and memory bandwidth.

🏟 Hardware Dynamic Range

Proper initialization (Xavier/He) perfectly bounds activations within the safe representable range of the hardware format, allowing safe scale-factors for INT8 quantization without clipping or underflow.

Underflow

Ideal Range (Std ≈ 1.0)

Overflow (NaN)

< 0.001 [-3.0, 3.0] > 65,504 (FP16 Max)

🔑 Key Insight: Before you can even attempt advanced EE508 techniques like Post-Training Quantization (PTQ) or Knowledge Distillation, the base model must be numerically stable. Stable variance = Hardware friendly ranges.