CUDA Parallel Gradient Reduction

How a GPU sums gradients across a batch — registers → shared memory → global memory

Registers Shared Memory Global Memory

Strategy

Threads / block 32

32 (1 warp)64128

Num blocks 4

1234

Playback

Phase Progress

Global reads

Sync barriers

Steps to complete

0 / 0

📡

Thread Initialization

Registers

Each thread computes its own partial gradient for one element of the batch. All gradient values live in thread-private registers at this stage — zero global memory traffic.