Transformer Geometry

RoPE makes attention care about distance, not just index.

Rotary Position Embedding rotates each 2D slice of query and key vectors by a position-dependent angle. That keeps content intact while turning absolute position into a relative phase difference inside the attention score.

What It Does
Rotates each pair of channels by position × frequency.
Why It Helps
The attention score ends up depending on the gap between tokens.
Cache Reality
0% smaller KV cache in normal decoding. The gain is reuse flexibility, not smaller memory.
Three-Sentence Summary
  1. 1. Query and key channels are grouped into 2D pairs.
  2. 2. Each pair is rotated by an angle that grows with token position.
  3. 3. When attention takes a dot product, those rotations collapse into a term based on relative offset.
Mental model: RoPE does not attach a position label to a token. It tilts the query and key so their alignment reveals how far apart the tokens are.
Geometry View

One 2D slice of the embedding

The dashed vectors are the original content directions. The solid vectors show the same content after position-dependent rotation.
Legend
Base content vectors
Rotated query
Rotated key
Relative angle
What Changed
Q rotated by 2.000 rad and K by 6.000 rad.
What Stayed Invariant
If you shift both tokens together, the relative phase term stays the same.
Reader Shortcut
RoPE turns “where is this token?” into “how far apart are these two tokens?”
Why The Math Works

RoPE preserves content and exposes distance

Rotation Formula
\[ \operatorname{RoPE}(x, m, i) = \begin{pmatrix} \cos(m\theta_i) & -\sin(m\theta_i) \\ \sin(m\theta_i) & \cos(m\theta_i) \end{pmatrix} \begin{pmatrix} x_{2i} \\ x_{2i+1} \end{pmatrix} \]

Read this as: take the original vector \(x\), look only at coordinates \(x_{2i}\) and \(x_{2i+1}\), and rotate that 2D pair by angle \(m\theta_i\). Here \(m\) is the token position and \(i\) chooses which frequency band you are in.

Dot Product Property
\[ \langle \operatorname{RoPE}(q,m,i), \operatorname{RoPE}(k,n,i) \rangle = q_i^\top R_{(n-m)\theta_i} k_i \]
\( \langle a, b \rangle = a^\top b \)

This is the single query-single key score for one 2D pair. Use lowercase \(q, k\) for individual vectors and uppercase \(QK^\top\) for the full attention-score matrix. The positions do not disappear; they collapse into the relative offset term \((n-m)\theta_i\).

Symbols In This Section
q, k are the unrotated query and key vectors.
q_i, k_i are their \(i\)th 2D slices.
m, n are the query and key positions.
\( R_{(n-m)\theta_i} \) is the rotation matrix for the relative phase gap.
Additive Positional Encoding
\[ (q+p_m)^\top (k+p_n) = q^\top k + q^\top p_n + p_m^\top k + p_m^\top p_n \]

Here \(p_m\) and \(p_n\) are positional vectors added directly onto token content. Content and position get mixed through extra cross terms, so the signal is less clean than RoPE's relative phase relation.

RoPE Attention Signal
\[ \text{score}_i = (q_i^\top k_i)\,\cos((n-m)\theta_i) + (q_{i,1}k_{i,2} - q_{i,2}k_{i,1})\,\sin((n-m)\theta_i) \]

The sine term is the phase-sensitive part: it tracks the oriented relationship between the two coordinates in pair \(i\), so it is part of the real positional signal, not just a negligible correction. Different frequency bands let the model track both short-range and long-range offsets, which is one reason RoPE tends to extrapolate better than learned absolute embeddings.

Caching Impact

RoPE helps cache reuse more than cache size

Normal Decoding
0%
KV cache memory reduction from RoPE alone.
Shifted Prefix
up to 100%
Potential recompute saved for a moved cached block.
Saved Work
2048
Token positions you do not need to fully recompute in that reuse scenario.
2048 tokens
+256
Absolute / Additive PE
cached prefix at old positions moved block no longer matches
“Absolute / Additive PE” means you add a position vector such as \(p_m\) directly into each token representation. Move the same prefix to a new location and the cached vectors still contain the old absolute position, so they no longer match the new block location. You usually have to recompute the whole block.
Recompute: 2048 token positions
RoPE-Aware Prefix Reuse
reuse cached content rephase keys by +256
RoPE keeps positional information as a phase relationship, so systems can sometimes reuse a cached prefix by applying the new offset instead of recomputing all token representations.
Full-prefix recompute avoided: 2048 token positions
Important nuance: vanilla RoPE does not make cached keys and values “position-free,” and it does not shrink the standard autoregressive KV cache. The practical benefit is that relative-position structure makes prefix reuse and sequence shifting more tractable than with absolute additive encodings. In the shifted-prefix scenario above, the full recompute saved can approach 100% of that moved block, but that is a reuse advantage, not a per-token memory reduction.