Add & Norm

The stabilization mechanism of the Transformer.

Why "Add & Norm"?

Deep networks struggle with vanishing gradients. The Transformer solves this by wrapping every sub-layer in a residual connection followed by normalization.

The "Add" (Residual)

The original input $X$ is added directly to the sub-layer output. This creates a "highway" for information to flow through without being degraded.

Numerical Normalization

Example Vector (Before):

[10.0, 2.0, -5.0, 20.0, 3.0]

After LayerNorm (Mean 0, Var 1):

[0.47, -0.47, -1.30, 1.65, -0.35]

Click Next to see the numerical transformation on the canvas.

Add & Norm

Why "Add & Norm"?

The "Add" (Residual)

Numerical Normalization

Title