FlashAttention: exact attention by changing IO order

S is the score/logit matrix: every query-token dot every key-token, before softmax.
P is the attention-probability matrix: row-wise softmax of S, used to weight V.
Q K V S scores P weights SRAM HBM

Parameters

Live memory model, fp16, 1 head

Q/K/V/O
Full attention N×N
One tile scores B×B
Approx saved

FlashAttention is not mainly “fewer FLOPs.” It is fewer HBM reads/writes by never materializing S or P.