Transformer Geometry

RoPE makes attention care about distance, not just index.

Rotary Position Embedding rotates each 2D slice of query and key vectors by a position-dependent angle. That keeps content intact while turning absolute position into a relative phase difference inside the attention score.

What It Does

Rotates each pair of channels by position × frequency.

Why It Helps

The attention score ends up depending on the gap between tokens.

Cache Reality

0% smaller KV cache in normal decoding. The gain is reuse flexibility, not smaller memory.

Three-Sentence Summary

1. Query and key channels are grouped into 2D pairs.
2. Each pair is rotated by an angle that grows with token position.
3. When attention takes a dot product, those rotations collapse into a term based on relative offset.

Mental model: RoPE does not attach a position label to a token. It tilts the query and key so their alignment reveals how far apart the tokens are.

Interactive Setup

See the rotation happen

Move the query and key positions, or shift the whole phrase together. If the relative gap stays the same, the dot-product relationship stays the same too.

Query Base Position (m) 2

Key Base Position (n) 6

Lock relative distance

The base gap stays at +4 while you drag either slider.

Shift Entire Phrase Together 0

This adds the same offset to both positions. It is the cleanest way to see RoPE preserve relative distance.

2D Pair Index (i) 0

Higher-index 2D pairs rotate more slowly because $\theta_i = 10000^{-2i/d_{\text{model}}}$ .

Notation Guide

x = original token vector before RoPE rotation

m = query token position, n = key token position

i = which 2D channel pair

\( (2i, 2i+1) \)

you are rotating

$\theta_i$ = rotation frequency used for pair

\(i\)

q, k = one query/key vector pair, while Q, K usually mean whole matrices of many queries and keys built from those same 2D pairs

Pair Map

i = 0 maps to (x0, x1)

Adjacent coordinates

This is a compact local view around the active pair. The top row shows pair index i, and the bottom row shows the two aligned embedding coordinates for each visible pair.

Live Attention Score n - m = +4

Actual Positions

Query 2

Key 6

Relative Phase

0.040 rad

theta = 1.000000

Rotated Q · Rotated K

0.000

-1 opposite 0 orthogonal +1 aligned

Geometry View

One 2D slice of the embedding

The dashed vectors are the original content directions. The solid vectors show the same content after position-dependent rotation.

Legend

Base content vectors

Rotated query

Rotated key

Relative angle

What Changed

Q rotated by 2.000 rad and K by 6.000 rad.

What Stayed Invariant

If you shift both tokens together, the relative phase term stays the same.

Reader Shortcut

RoPE turns “where is this token?” into “how far apart are these two tokens?”

Why The Math Works

RoPE preserves content and exposes distance

Rotation Formula

\[ \operatorname{RoPE}(x, m, i) = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix} \]

Read this as: take the original vector $$x$$ , look only at coordinates $x_{2i}$ and $x_{2i+1}$ , and rotate that 2D pair by angle $m\theta_i$ . Here $$m$$ is the token position and $$i$$ chooses which frequency band you are in.

Dot Product Property

\[ \langle \operatorname{RoPE}(q,m,i), \operatorname{RoPE}(k,n,i) \rangle = q_i^\top R_{(n-m)\theta_i} k_i \]

\langle a, b \rangle = a^\top b

This is the single query-single key score for one 2D pair. Use lowercase $$q, k$$ for individual vectors and uppercase $QK^\top$ for the full attention-score matrix. The positions do not disappear; they collapse into the relative offset term $(n-m)\theta_i$ .

Symbols In This Section

q, k are the unrotated query and key vectors.

q_i, k_i are their

\(i\)

th 2D slices.

m, n are the query and key positions.

R_{(n-m)\theta_i}

is the rotation matrix for the relative phase gap.

Additive Positional Encoding

\[ (q+p_m)^\top (k+p_n) = q^\top k + q^\top p_n + p_m^\top k + p_m^\top p_n \]

Here $$p_m$$ and $$p_n$$ are positional vectors added directly onto token content. Content and position get mixed through extra cross terms, so the signal is less clean than RoPE's relative phase relation.

RoPE Attention Signal

\[ \text{score}_i = (q_i^\top k_i)\,\cos((n-m)\theta_i) + (q_{i,1}k_{i,2} - q_{i,2}k_{i,1})\,\sin((n-m)\theta_i) \]

The sine term is the phase-sensitive part: it tracks the oriented relationship between the two coordinates in pair $$i$$ , so it is part of the real positional signal, not just a negligible correction. Different frequency bands let the model track both short-range and long-range offsets, which is one reason RoPE tends to extrapolate better than learned absolute embeddings.

Caching Impact

RoPE helps cache reuse more than cache size

Normal Decoding

KV cache memory reduction from RoPE alone.

Shifted Prefix

up to 100%

Potential recompute saved for a moved cached block.

Saved Work

2048

Token positions you do not need to fully recompute in that reuse scenario.

Cached Prefix Length 2048 tokens

New Absolute Offset For That Prefix +256

Absolute / Additive PE

cached prefix at old positions moved block no longer matches

“Absolute / Additive PE” means you add a position vector such as

\(p_m\)

directly into each token representation. Move the same prefix to a new location and the cached vectors still contain the old absolute position, so they no longer match the new block location. You usually have to recompute the whole block.

Recompute: 2048 token positions

RoPE-Aware Prefix Reuse

reuse cached content rephase keys by +256

RoPE keeps positional information as a phase relationship, so systems can sometimes reuse a cached prefix by applying the new offset instead of recomputing all token representations.

Full-prefix recompute avoided: 2048 token positions

Important nuance: vanilla RoPE does not make cached keys and values “position-free,” and it does not shrink the standard autoregressive KV cache. The practical benefit is that relative-position structure makes prefix reuse and sequence shifting more tractable than with absolute additive encodings. In the shifted-prefix scenario above, the full recompute saved can approach 100% of that moved block, but that is a reuse advantage, not a per-token memory reduction.