Hiding memory latency by switching between warps in a single pipeline.
Current Scenario: 40 threads = 5 warps.
Scheduler selects Ready warps for the Active slot.
Hardware selects a warp and issues its current instruction to the lanes.
While real GPUs manage thousands of warps, we are simulating a smaller pool of 5 to clearly see how the scheduler hides latency by skipping Stalled warps.