The stabilization mechanism of the Transformer.
Deep networks struggle with vanishing gradients. The Transformer solves this by wrapping every sub-layer in a residual connection followed by normalization.
The original input $X$ is added directly to the sub-layer output. This creates a "highway" for information to flow through without being degraded.
Description goes here.