Linear Transformation in PE

The Transformer encodes position using pairs of Sine and Cosine functions at different frequencies. This allows the encoding at pos + k to be found simply by rotating the encoding at pos.

5
3
0

Higher dimensions have lower frequencies (slower rotation).

Rotation Matrix Math
M(k)
1.00 0.00 0.00 1.00
×
PE(pos)
1.00 0.00
=
PE(pos+k)
1.00 0.00

Full 128-D Vector (Heatmap)

-1 0 +1

The full vector consists of 64 pairs of coordinates bundled together. The highlighted box shows the single pair (2 dimensions) currently plotted on the right.

PE(pos)
PE(pos+k)
Dim 0 Dim 64 Dim 127

2D Projection of Dimension 0 & 1

Vector at pos
Vector at pos + k
Rotation angle (ω·k)

Notice how changing the Offset (k) only changes the angle of rotation. The model can learn a weight matrix to apply this exact rotation to detect relative distances!