CUDA Branch Divergence Hardware Masking

Analyzing Total System Efficiency & Serialization Penalties

Unified (Compute/Finish)
Path A (Do_this)
Path B (Do_that)
Hardware SIMT Mask (Execution State)
Thread Data Mapping (Processed IDs)
Instruction Stream (PC) ● EXECUTING
0 Time (Cycles)
0 Instant Active
100% Total Cumulative Eff.
0ms Total Latency
In the Naive case, warps execute branching logic serially, wasting clock cycles. In the Optimal case, Aligning data to warp boundaries allows the GPU to process branches in parallel.