Sparse & Local Attention Patterns
Compare full O(n²) attention with efficient alternatives — local windowed and sparse random attention. Hover over any row to see which tokens a query attends to.
Number of tokens in the sequence
Tokens attended on each side (local attention)
Random tokens each query attends to (sparse)
Full Attention
O(n²) = 256Every token attends to every other token
Local Attention
O(n·w)Each token attends only to ±w neighbors
Sparse Attention
O(n·k)Each token attends to k randomly-chosen tokens
Attention Arc Diagram — hover a matrix row to inspect a token
Click a row to lock the selectionConnections Count Comparison (live, updates with sliders)
🌐 Full Attention
Every token can see every other token. Complexity is O(n²) in both time and memory, making it prohibitive for long sequences (e.g., 10 000+ tokens). This is the standard Transformer attention from Attention is All You Need.
🪟 Local (Windowed) Attention
Each token attends only to a fixed window of ±w neighbours. Complexity drops to O(n·w). Ideal when local context is enough — speech, time series, or sliding-window language models. Used in Longformer.
✨ Sparse Attention
Each token attends to k tokens chosen from the full sequence. Complexity becomes O(n·k) where k ≪ n. Useful for long-range tasks like document summarisation or genomic data. Used in BigBird, Reformer, and Sparse Transformers.