PagedAttention from arXiv:2309.06180
32-token memory, with and without paging
A simple animation of why reserving a full 32-token region for every request wastes memory, and how paged KV blocks let three requests prefill and decode together.
Memory
32 tokens
Block size
4 tokens
Requests
A, B, C
A
Active request
1
Resident requests
32
Allocated tokens
Tabs
What this tab shows
Step timeline
32-token memory
8 physical blocks x 4 tokens each.
Linear 32-slot view
What is happening in this step
Prefill and decode are shown one step at a time.