FlashAttention

Block-wise Attention in SRAM

Step 0 / 23

Phase: init

Start: All matrices (Q, K, V) reside in High-Bandwidth Memory (HBM). Fast SRAM is empty.

HBM: Slow, large main memory.
SRAM: Fast, small on-chip memory.
Key Takeaway: By iterating via nested loops, FlashAttention keeps all O(N²) intermediate operations inside SRAM, drastically reducing HBM read/writes.

Why these loops specifically?

  • Outer Loop (K, V blocks): Grab a chunk of Keys and Values and hold it in the fast SRAM.
  • Inner Loop (Q blocks): While holding that K/V chunk, sweep every single Query past it.

Intermediate scores are born, used, and discarded entirely within SRAM, avoiding the O(N²) memory bottleneck of HBM.