Transformer visual Positional encoding Interactive

Positional Encoding, under a microscope 🔬

This interactive canvas shows how positional encoding works in transformers. Self-attention alone does not know token order, so each token embedding is combined with a position vector built from sine and cosine patterns. The result carries both word meaning and sequence order.

1) Build your sentence

Use a repeated word like “cat” or “the” to see why the same word needs different position vectors.

Note: token embeddings in this demo are simulated with a deterministic function so the visualization is stable. In real transformer models, token embeddings are learned during training.

2) Embedding size D

Even dimensions use sine. Odd dimensions use cosine.

D = 8

Bigger D means more coordinate slots to store position patterns.

3) Focus dimension

Watch one coordinate wiggle across positions.

The chosen dimension is currently d0.

Step-by-step

This is the exact add operation the lecture highlights.

Word meaning vector comes from the token embedding.

Position vector comes from the sinusoidal formula at this position.

Final input to the transformer is their elementwise sum.

Selected token

Formula used at each coordinate

PE(pos, 2i)=sin(pos / 10000^(2i/D)), PE(pos, 2i+1)=cos(pos / 10000^(2i/D))

＋ add position vector

→ result sent into attention

Coordinate-by-coordinate addition

Each slot in the vector gets its own tiny addition.

Dim	Token emb	PE	Sum	Interpretation

Why the same word needs different encodings

Without position, repeated words are twins wearing identical jerseys. With position, each one gets a different coordinate fingerprint.

Two copies, two different final vectors

The word vector is the same. Only the positional vector changes.

Difference vector

Notice how the word meaning stayed the same, but the final vector changed because the position channels changed.

Positional encoding as a matrix

Each row is a token position. Each column is a vector dimension. No single scalar position value is added. A whole vector is added.

Pick another position to compare

Great for watching how nearby positions differ only a bit, while farther positions drift more.

Compare current selected position with

~ Dimension 0 uses sin

What this wave is telling you

The model does not use one giant position number. It uses many coordinated wave channels.

• Even dimensions are sine, odd dimensions are cosine.

• Different dimensions change at different rates, so position 3 and position 4 look similar but not identical.

• Combining many such channels makes each position act like a small frequency fingerprint.

• That is why transformers can tell order apart even though self-attention itself does not naturally know sequence order.

Quick intuition panel

Why not a simple monotonic or single periodic function?

Only word embedding

The model knows what the token is, but not where it is. Order becomes foggy.

One monotonic position value

Values can grow too large and become numerically awkward. One scalar is also too skinny to mix richly with a D-dimensional embedding.

Many sinusoidal channels

Stable values, multiple frequencies, smooth relative shifts, and a distinctive pattern for each position.