Understanding Internal Covariate Shift

Why deep neural networks struggle to learn and how Batch Normalization acts as a "stabilizer."

WHAT exactly do we normalize?

A common misconception is that we normalize all neurons in a layer together. We don't.

Instead, for each individual neuron, we look at its activations across the entire mini-batch.

The Neuron Calculation:

In a layer with 100 neurons, we perform 100 separate normalizations. Neuron #1 is normalized using the mean of Neuron #1's values across all 64 images (examples) in your batch. Neuron #2 is normalized using Neuron #2's values, and so on.

Key Rule:

"One neuron is normalized against its own behavior across the batch, not against its neighbors in the layer."

Activation Matrix (Feature Map)
N1, N2...: Specific Neurons (Features) in the layer.
Ex1, Ex2...: Different Examples (images/data points) in the current mini-batch.
Normalized Together
Independent

The Moving Dartboard Analogy

Think of each layer as a person throwing a dart. In Vanilla Mode, the board moves because earlier players changed the rules. In Batch Norm Mode, we fix the board in place based on the "average" position.

Bullseyes
0
System State
Internal Covariate Shift
Bullseye!

Mathematical Proof

How real activations drift across Epochs 1-5.

Concept

Batch Norm fixes the mean and variance so later layers don't have to keep re-learning their inputs.

The BN Formula

$\hat{x} = \frac{x - \mu}{\sqrt{\sigma^2 + \epsilon}}$

Convergence Speed

BN allows for 14x faster training by stabilizing gradients.

With Batch Norm

The gradient path is smoother, allowing for a higher learning rate. The model converges in fewer steps.

Without Batch Norm

Constant shifts force a lower learning rate to prevent divergence, leading to a slow, shaky descent.

Observation

"BN acts as a smoother for the loss landscape."