Linear Transformation in PE

The Transformer encodes position using pairs of Sine and Cosine functions at different frequencies. This allows the encoding at pos + k to be found simply by rotating the encoding at pos.

Base Position (pos) 5

Offset (k) 3

Dimension Index (i) 0

Higher dimensions have lower frequencies (slower rotation).

Rotation Matrix Math

M(k) 1.00 0.00 0.00 1.00 \times PE(pos) 1.00 0.00 = PE(pos+k) 1.00 0.00

Full 128-D Vector (Heatmap)

-1 0 +1

The full vector consists of 64 pairs of coordinates bundled together. The highlighted box shows the single pair (2 dimensions) currently plotted on the right.

PE(pos)

PE(pos+k)

Dim 0 Dim 64 Dim 127

2D Projection of Dimension 0 & 1

Vector at pos

Vector at pos + k

Rotation angle (ω·k)

Notice how changing the Offset (k) only changes the angle of rotation. The model can learn a weight matrix to apply this exact rotation to detect relative distances!