Warp-Level Scheduling (FGMT)

Hiding memory latency by switching between warps in a single pipeline.

Scalar CPU Code

// Explicit Loop

for (i=0; i<40; i++) {
C[i] = A[i] + B[i];
}

GPU SPMD Code

// One "Thread"

void kernel() {
C[i] = A[i] + B[i];
}

Warp Instruction Stream (PC)

PC X: load r1, A[i] LSU

PC X+1: load r2, B[i] LSU

PC X+2: add r3, r1, r2 ALU

PC X+3: store C[i], r3 LSU

1 The Warp Pool

Latency Hiding (Memory Stalls)

Current Scenario: 40 threads = 5 warps.
Scheduler selects Ready warps for the Active slot.

2 SIMD Hardware

Active: Warp 0

Mode: 8 lanes per warp

Warp PC

PC X

Issue Op

LOAD

Shared Physical Data Path

Execution Cycle

Hardware selects a warp and issues its current instruction to the lanes.

Architectural Accuracy

While real GPUs manage thousands of warps, we are simulating a smaller pool of 5 to clearly see how the scheduler hides latency by skipping Stalled warps.