Transformer visual Positional encoding Interactive

Positional Encoding, under a microscope 🔬

This interactive canvas shows how positional encoding works in transformers. Self-attention alone does not know token order, so each token embedding is combined with a position vector built from sine and cosine patterns. The result carries both word meaning and sequence order.

1) Build your sentence
Use a repeated word like “cat” or “the” to see why the same word needs different position vectors.
Note: token embeddings in this demo are simulated with a deterministic function so the visualization is stable. In real transformer models, token embeddings are learned during training.
2) Embedding size D
Even dimensions use sine. Odd dimensions use cosine.
D = 8
Bigger D means more coordinate slots to store position patterns.
3) Focus dimension
Watch one coordinate wiggle across positions.
The chosen dimension is currently d0.
Step-by-step
This is the exact add operation the lecture highlights.
Word meaning vector comes from the token embedding.
Position vector comes from the sinusoidal formula at this position.
Final input to the transformer is their elementwise sum.
T
Selected token
ƒ
Formula used at each coordinate
PE(pos, 2i)=sin(pos / 10000^(2i/D)), PE(pos, 2i+1)=cos(pos / 10000^(2i/D))
+ add position vector
→ result sent into attention
Coordinate-by-coordinate addition
Each slot in the vector gets its own tiny addition.
Dim Token emb PE Sum Interpretation
Why the same word needs different encodings
Without position, repeated words are twins wearing identical jerseys. With position, each one gets a different coordinate fingerprint.
Two copies, two different final vectors
The word vector is the same. Only the positional vector changes.
Difference vector
Notice how the word meaning stayed the same, but the final vector changed because the position channels changed.
Positional encoding as a matrix
Each row is a token position. Each column is a vector dimension. No single scalar position value is added. A whole vector is added.
Pick another position to compare
Great for watching how nearby positions differ only a bit, while farther positions drift more.
Compare current selected position with
~ Dimension 0 uses sin
What this wave is telling you
The model does not use one giant position number. It uses many coordinated wave channels.
• Even dimensions are sine, odd dimensions are cosine.
• Different dimensions change at different rates, so position 3 and position 4 look similar but not identical.
• Combining many such channels makes each position act like a small frequency fingerprint.
• That is why transformers can tell order apart even though self-attention itself does not naturally know sequence order.
Quick intuition panel
Why not a simple monotonic or single periodic function?
Only word embedding
The model knows what the token is, but not where it is. Order becomes foggy.
One monotonic position value
Values can grow too large and become numerically awkward. One scalar is also too skinny to mix richly with a D-dimensional embedding.
Many sinusoidal channels
Stable values, multiple frequencies, smooth relative shifts, and a distinctive pattern for each position.