Warp-Level Scheduling (FGMT)

Hiding memory latency by switching between warps in a single pipeline.

Scalar CPU Code
// Explicit Loop
for (i=0; i<40; i++) {
  C[i] = A[i] + B[i];
}
GPU SPMD Code
// One "Thread"
void kernel() {
  C[i] = A[i] + B[i];
}
Warp Instruction Stream (PC)
PC X: load r1, A[i] LSU
PC X+1: load r2, B[i] LSU
PC X+2: add r3, r1, r2 ALU
PC X+3: store C[i], r3 LSU

1 The Warp Pool

Latency Hiding (Memory Stalls)

Current Scenario: 40 threads = 5 warps.
Scheduler selects Ready warps for the Active slot.

2 SIMD Hardware

Active: Warp 0
Mode: 8 lanes per warp
Warp PC
PC X
Issue Op
LOAD
Shared Physical Data Path

Execution Cycle

Hardware selects a warp and issues its current instruction to the lanes.

Architectural Accuracy

While real GPUs manage thousands of warps, we are simulating a smaller pool of 5 to clearly see how the scheduler hides latency by skipping Stalled warps.