CUDA Parallel Gradient Reduction

How a GPU sums gradients across a batch — registers → shared memory → global memory

Registers Shared Memory Global Memory
📡

Thread Initialization

Registers

Each thread computes its own partial gradient for one element of the batch. All gradient values live in thread-private registers at this stage — zero global memory traffic.