RoPE makes attention care about distance, not just index.
Rotary Position Embedding rotates each 2D slice of query and key vectors by a position-dependent angle. That keeps content intact while turning absolute position into a relative phase difference inside the attention score.
- 1. Query and key channels are grouped into 2D pairs.
- 2. Each pair is rotated by an angle that grows with token position.
- 3. When attention takes a dot product, those rotations collapse into a term based on relative offset.
One 2D slice of the embedding
RoPE preserves content and exposes distance
Read this as: take the original vector \(x\), look only at coordinates \(x_{2i}\) and \(x_{2i+1}\), and rotate that 2D pair by angle \(m\theta_i\). Here \(m\) is the token position and \(i\) chooses which frequency band you are in.
This is the single query-single key score for one 2D pair. Use lowercase \(q, k\) for individual vectors and uppercase \(QK^\top\) for the full attention-score matrix. The positions do not disappear; they collapse into the relative offset term \((n-m)\theta_i\).
Here \(p_m\) and \(p_n\) are positional vectors added directly onto token content. Content and position get mixed through extra cross terms, so the signal is less clean than RoPE's relative phase relation.
The sine term is the phase-sensitive part: it tracks the oriented relationship between the two coordinates in pair \(i\), so it is part of the real positional signal, not just a negligible correction. Different frequency bands let the model track both short-range and long-range offsets, which is one reason RoPE tends to extrapolate better than learned absolute embeddings.