← Back to All Visualizations

Transformer Memory & Compute Bottlenecks

During training, every forward pass must store Q, K, V, and the full n×n attention matrix so that backpropagation can compute gradients. Self-attention memory grows O(n²) — quadratically with sequence length — while feedforward memory grows O(n). This is why long-sequence training is expensive and why Flash Attention was invented.

Model Parameters

64 512 4096
d_k = 64 per head
Q/K/V dim = d_model / h
1 6 24

Activation Memory Breakdown (per layer, float32)

Each segment is a tensor stored during the forward pass. ⚠ red = n² bottleneck — these grow quadratically. When n > 1024 they flash to signal the bottleneck.

Scaling Curves

How memory and compute grow as n increases. The purple dashed line marks the current n. Watch attention (red) overtake feedforward (blue) as n grows.

Chart A — Memory vs Sequence Length (MB / layer)

Chart B — Compute vs Sequence Length (GFLOPs / layer)

What Gets Stored for Backprop?

Step through the forward pass to see each activation saved to memory.

Memory accumulated (single layer)
0.00 MB
of total activation memory for this layer
Stage colours
Active (processing)
Storing activations
Done

Key Insight: The n² Bottleneck

Attention score matrix
— MB
per layer  ·  MB across all layers
Double n →
4× memory
for attention scores (quadratic: n² × 4 bytes)
Why Flash Attention?
Recomputes the n×n scores in SRAM tiles instead of storing them — trades extra FLOPs for dramatically less HBM usage.