Tokenizer & BPE Visualizer

1. Build a tokenization example

This demo uses a small lowercase toy vocabulary. It is not a full production tokenizer, but it shows the core BPE idea used in modern subword tokenization: repeatedly merge the highest-priority adjacent pair.

Input Text

Focus Word Animation Speed

2. Step through merges

Step 0 shows the focus word split into characters plus an end-of-word marker.

Current merge: None yet

Current Symbol Stream

The highlighted pair is the next adjacent pair chosen by BPE.

Step 0

Chosen Pair

Merged Result

Why This Matters

Character-level tokenization is flexible, but it creates long sequences. BPE compresses frequent patterns into larger subword units while keeping the ability to fall back to characters for unfamiliar words.

Merge Rules

Toy merge ranks. Lower rank means the pair gets priority earlier.

0 rules

3. Compare tokenization strategies

The exact numbers here come from the toy vocabulary used in this page, but the tradeoff is the same in real tokenizers: BPE usually lands between pure characters and whole words.

Character-Level

0 tokens

Maximum flexibility, but the longest sequence length.

Word-Level

0 tokens

Shorter sequences, but brittle on unseen words and spelling variants.

Toy BPE

0 tokens

Reusable subwords reduce sequence length while still backing off to smaller pieces.

What students should notice

1. Text is not consumed raw

Transformers do not read strings directly. They read token IDs, so tokenization is the front door to embeddings, attention, and context length.

2. BPE is greedy and local

At each step, the tokenizer looks at adjacent symbol pairs and picks the best-ranked merge rule that exists in its learned vocabulary.

3. Token count affects cost

Longer token sequences mean more attention work and more KV-cache memory. Better compression can lower inference cost, although vocabulary size also matters.

Token count summary

Characters

Words

Toy BPE

BPE vs Characters

Toy vocabulary

The merge rules below are intentionally small and human-readable. They cover recurring English fragments such as ing, er, tion, and token.