Positional Encoding

pos is the position in the sequence, and i is the dimension index in the encoding space.

Why not PE(pos), instead of PE(pos, i)?

Because positional encoding is a vector, not a scalar.

PE(pos, i) = the i-th component of the vector for position pos

What function should we pick for PE(pos, i)?

  • Monotonic function: Exploding gradients. Huge values (poor numerical stability).
  • Periodic function: The embedding of different positions would be exactly the same!
  • Solution: A mix of periodic functions (sin/cos) with varying frequencies across dimensions to ensure unique vectors for every position.

"The same word at different positions will have different encodings"

Click on two words above to see their architectural flow.

What does this diagram show?

  • Input Embedding (Green): The base meaning of the word. If you click the two "cat"s, their green vectors are identical. To the AI, the dictionary definition is the same.
  • Positional Embedding (Orange): The location of the word. The "cat" at position 2 has a completely different orange vector than the "cat" at position 6.
  • The Takeaway: By adding (+) these vectors, the final embedding for "cat" at position 2 is completely distinct from "cat" at position 6. The model now sees a single vector that simultaneously means "feline" AND "located near the beginning".

Positional Encoding As a Matrix (D=4)

The Mathematical Formula

PE(pos, 2i) = sin( pos
n2i/D
)

PE(pos, 2i+1) = cos( pos
n2i/D
)

The Effect of n

101001k10k100k

📉 If n is too small (e.g. 10)

Periodic values show up early. The highest dimensions complete full cycles too quickly. This causes distant positions to end up with very similar encodings, confusing the model.

📏 If n is too large (e.g. 100,000)

Differences are smaller. The waves for higher dimensions stretch out so much that their values barely change across the sequence, providing very little positional information.

Pos Word Pair i=0 Pair i=1
SIN Dim 0
sin(pos / 1)
COS Dim 1
cos(pos / 1)
SIN Dim 2
sin(pos / 100)
COS Dim 3
cos(pos / 100)
Values calculated live for your sentence above (D=4)

Sinusoidal Functions Across Sequence

The Mathematical Function

While the diagrams above use small 6-dimensional vectors for simplicity, real models use a dmodel of 512 or higher. How do we generate unique vectors for every position without the numbers exploding?

The heatmap reveals the trick: a mix of sine and cosine waves of varying frequencies.

  • Top Rows (Fast changes): Lower dimensions change colors rapidly as you move across the sequence, acting like the "seconds" hand on a clock to track local, small changes in position.
  • Bottom Rows (Slow changes): Higher dimensions change colors slowly, acting like the "hours" hand to track macro, long-distance positions.

Takeaway: This wavy mathematical function ensures every single position gets a 100% unique vector, allowing the AI to calculate the exact distance between any two words.