CUDA Branch Divergence Hardware Masking

Analyzing Total System Efficiency & Serialization Penalties

Speed:

Unified (Compute/Finish)

Path A (Do_this)

Path B (Do_that)

Hardware SIMT Mask (Execution State)

Thread Data Mapping (Processed IDs)

Instruction Stream (PC) ● EXECUTING

0 Time (Cycles)

0 Instant Active

100% Total Cumulative Eff.

0ms Total Latency

In the Naive case, warps execute branching logic serially, wasting clock cycles. In the Optimal case, Aligning data to warp boundaries allows the GPU to process branches in parallel.