PagedAttention from arXiv:2309.06180

32-token memory, with and without paging

A simple animation of why reserving a full 32-token region for every request wastes memory, and how paged KV blocks let three requests prefill and decode together.

Memory
32 tokens
Block size
4 tokens
Requests
A, B, C
A
Active request
1
Resident requests
32
Allocated tokens
Tabs
What this tab shows
Step timeline

32-token memory

8 physical blocks x 4 tokens each.

Linear 32-slot view

What is happening in this step

Prefill and decode are shown one step at a time.