Sparse & Local Attention Patterns

Compare full O(n²) attention with efficient alternatives — local windowed and sparse random attention. Hover over any row to see which tokens a query attends to.

Sequence Length n = 16

Number of tokens in the sequence

Local Window w = ±2

Tokens attended on each side (local attention)

Sparse k k = 3

Random tokens each query attends to (sparse)

Full Attention

O(n²) = 256

Every token attends to every other token

Local Attention

O(n·w)

Each token attends only to ±w neighbors

Sparse Attention

O(n·k)

Each token attends to k randomly-chosen tokens

Attention Arc Diagram — hover a matrix row to inspect a token

Click a row to lock the selection

Connections Count Comparison (live, updates with sliders)

Full Attention 256 connections

Local Attention 80 connections

Sparse Attention 48 connections

🌐 Full Attention

Every token can see every other token. Complexity is O(n²) in both time and memory, making it prohibitive for long sequences (e.g., 10 000+ tokens). This is the standard Transformer attention from Attention is All You Need.

🪟 Local (Windowed) Attention

Each token attends only to a fixed window of ±w neighbours. Complexity drops to O(n·w). Ideal when local context is enough — speech, time series, or sliding-window language models. Used in Longformer.

✨ Sparse Attention

Each token attends to k tokens chosen from the full sequence. Complexity becomes O(n·k) where k ≪ n. Useful for long-range tasks like document summarisation or genomic data. Used in BigBird, Reformer, and Sparse Transformers.