PagedAttention from arXiv:2309.06180

32-token memory, with and without paging

A simple animation of why reserving a full 32-token region for every request wastes memory, and how paged KV blocks let three requests prefill and decode together.

Memory

32 tokens

Block size

4 tokens

Requests

A, B, C

Active request

Resident requests

Allocated tokens

Tabs

What this tab shows

Step timeline

32-token memory

8 physical blocks x 4 tokens each.

Linear 32-slot view

What is happening in this step

Prefill and decode are shown one step at a time.