S is the score/logit matrix: every query-token dot every key-token, before softmax.
P is the attention-probability matrix: row-wise softmax of S, used to weight V.
Q
K
V
S scores
P weights
SRAM
HBM
Live memory model, fp16, 1 head
FlashAttention is not mainly “fewer FLOPs.” It is fewer HBM reads/writes by never materializing S or P.