Home › Training › Activation Checkpointing

Activation Checkpointing

Backpropagation needs the forward activation at every layer to compute gradients. For an L-layer network, naïvely storing all activations costs O(L) memory. Checkpointing saves only every K-th activation and recomputes the others on demand during the backward pass — trading roughly √L extra forward work for O(√L) memory. This is what lets practitioners fit a 70B-parameter transformer on a single GPU.

Network

Layers (L) 12

Activation size per layer (MB) 1

Sets the per-layer memory cost

Checkpoint strategy

Checkpoints kept: —

Live stats

Phase

Idle

Current layer

—

Live mem (MB)

Peak mem (MB)

Total forward layer-passes

0 vs 0 baseline (+0%)

Playback

Pick a strategy and press Play to walk through one training step.

Network — forward & backward pass

Checkpoint (kept) Activation saved Re-computing Discarded

Memory over time (live)

Records as you step. Forward fills memory, backward drains it; checkpoint strategies show the characteristic sawtooth from recompute.

Live

Peak

Pareto: peak memory vs total forward work

Each dot = one strategy. Lower-left = better.

▶ Why this matters

Naive backprop is O(L) memory. The chain rule needs the forward activation a_l to compute the gradient at layer l. Saving every layer's activation grows linearly with depth, and modern transformers can have hundreds of layers.

Checkpoint every K layers. Save activations only at positions 0, K, 2K, …, L. During the backward pass, when we need a_l for l not on a checkpoint, we re-execute the forward pass starting from the nearest earlier checkpoint.

The √L sweet spot. Setting K = √L gives O(√L) memory and ~33% extra forward work overall. For a 100-layer network: peak memory drops 10×, you do one full extra forward pass spread across the backward — usually a great trade on a memory-bound accelerator. (Chen et al. 2016, "Training Deep Nets with Sublinear Memory Cost")

The systems angle. A100/H100 GPUs are typically memory-bound on attention training. Activation checkpointing is the single biggest knob between "fits on one GPU" and "needs sharded optimizer states across a pod" — pair it with mixed precision and you can train models several times larger than the naive memory budget allows.