Model Parameters
Activation Memory Breakdown (per layer, float32)
Each segment is a tensor stored during the forward pass. ⚠ red = n² bottleneck — these grow quadratically. When n > 1024 they flash to signal the bottleneck.
Scaling Curves
How memory and compute grow as n increases. The purple dashed line marks the current n. Watch attention (red) overtake feedforward (blue) as n grows.
Chart A — Memory vs Sequence Length (MB / layer)
Chart B — Compute vs Sequence Length (GFLOPs / layer)
What Gets Stored for Backprop?
Step through the forward pass to see each activation saved to memory.