📡
Thread Initialization
RegistersEach thread computes its own partial gradient for one element of the batch. All gradient values live in thread-private registers at this stage — zero global memory traffic.
How a GPU sums gradients across a batch — registers → shared memory → global memory
Each thread computes its own partial gradient for one element of the batch. All gradient values live in thread-private registers at this stage — zero global memory traffic.