Backpropagation needs the forward activation at every layer to compute gradients. For an L-layer network, naïvely storing all activations costs O(L) memory. Checkpointing saves only every K-th activation and recomputes the others on demand during the backward pass — trading roughly √L extra forward work for O(√L) memory. This is what lets practitioners fit a 70B-parameter transformer on a single GPU.
Records as you step. Forward fills memory, backward drains it; checkpoint strategies show the characteristic sawtooth from recompute.
Each dot = one strategy. Lower-left = better.
Naive backprop is O(L) memory. The chain rule needs the forward activation al to compute the gradient at layer l. Saving every layer's activation grows linearly with depth, and modern transformers can have hundreds of layers.
Checkpoint every K layers. Save activations only at positions 0, K, 2K, …, L. During the backward pass, when we need al for l not on a checkpoint, we re-execute the forward pass starting from the nearest earlier checkpoint.
The √L sweet spot. Setting K = √L gives O(√L) memory and ~33% extra forward work overall. For a 100-layer network: peak memory drops 10×, you do one full extra forward pass spread across the backward — usually a great trade on a memory-bound accelerator. (Chen et al. 2016, "Training Deep Nets with Sublinear Memory Cost")
The systems angle. A100/H100 GPUs are typically memory-bound on attention training. Activation checkpointing is the single biggest knob between "fits on one GPU" and "needs sharded optimizer states across a pod" — pair it with mixed precision and you can train models several times larger than the naive memory budget allows.