Transformer Training Memory

Model Parameters

Sequence Length (n)

64 512 4096

Model Dim (d_model)

d_k = 64 per head

Num Heads (h)

Q/K/V dim = d_model / h

Num Layers (L)

1 6 24

Activation Memory Breakdown (per layer, float32)

Each segment is a tensor stored during the forward pass. ⚠ red = n² bottleneck — these grow quadratically. When n > 1024 they flash to signal the bottleneck.

Scaling Curves

How memory and compute grow as n increases. The purple dashed line marks the current n. Watch attention (red) overtake feedforward (blue) as n grows.

Chart A — Memory vs Sequence Length (MB / layer)

Chart B — Compute vs Sequence Length (GFLOPs / layer)

What Gets Stored for Backprop?

Step through the forward pass to see each activation saved to memory.

Memory accumulated (single layer)

0.00 MB

of total activation memory for this layer

Stage colours

Active (processing)

Storing activations

Done

Key Insight: The n² Bottleneck

Attention score matrix

— MB

per layer · — MB across all layers

Double n →

4× memory

for attention scores (quadratic: n² × 4 bytes)

Why Flash Attention?

Recomputes the n×n scores in SRAM tiles instead of storing them — trades extra FLOPs for dramatically less HBM usage.