Why deep neural networks struggle to learn and how Batch Normalization acts as a "stabilizer."
A common misconception is that we normalize all neurons in a layer together. We don't.
Instead, for each individual neuron, we look at its activations across the entire mini-batch.
In a layer with 100 neurons, we perform 100 separate normalizations. Neuron #1 is normalized using the mean of Neuron #1's values across all 64 images (examples) in your batch. Neuron #2 is normalized using Neuron #2's values, and so on.
"One neuron is normalized against its own behavior across the batch, not against its neighbors in the layer."
Think of each layer as a person throwing a dart. In Vanilla Mode, the board moves because earlier players changed the rules. In Batch Norm Mode, we fix the board in place based on the "average" position.
How real activations drift across Epochs 1-5.
Batch Norm fixes the mean and variance so later layers don't have to keep re-learning their inputs.
BN allows for 14x faster training by stabilizing gradients.
The gradient path is smoother, allowing for a higher learning rate. The model converges in fewer steps.
Constant shifts force a lower learning rate to prevent divergence, leading to a slow, shaky descent.
"BN acts as a smoother for the loss landscape."