pos is the position in the sequence, and i is the dimension index in the encoding space.
Because positional encoding is a vector, not a scalar.
PE(pos, i) = the i-th component of the vector for position pos
Click on two words above to see their architectural flow.
The Mathematical Formula
Periodic values show up early. The highest dimensions complete full cycles too quickly. This causes distant positions to end up with very similar encodings, confusing the model.
Differences are smaller. The waves for higher dimensions stretch out so much that their values barely change across the sequence, providing very little positional information.
| Pos | Word | Pair i=0 | Pair i=1 | ||
|---|---|---|---|---|---|
|
SIN Dim 0
sin(pos / 1)
|
COS Dim 1
cos(pos / 1)
|
SIN Dim 2
sin(pos / 100)
|
COS Dim 3
cos(pos / 100)
|
||
While the diagrams above use small 6-dimensional vectors for simplicity, real models use a dmodel of 512 or higher. How do we generate unique vectors for every position without the numbers exploding?
The heatmap reveals the trick: a mix of sine and cosine waves of varying frequencies.
Takeaway: This wavy mathematical function ensures every single position gets a 100% unique vector, allowing the AI to calculate the exact distance between any two words.