FlashAttention: exact attention by changing IO order

S is the score/logit matrix: every query-token dot every key-token, before softmax.

P is the attention-probability matrix: row-wise softmax of S, used to weight V.

Q K V S scores P weights SRAM HBM

Parameters

N sequence length: d head dim: B tile size: speed:

Q/K/V/O

Full attention N×N

One tile scores B×B

Approx saved

FlashAttention is not mainly “fewer FLOPs.” It is fewer HBM reads/writes by never materializing S or P.