NLP Foundations

Tokenizer & BPE Visualizer

Watch raw text turn into tokens. Step through a toy Byte Pair Encoding pipeline, see which adjacent pair gets merged next, and compare character, word, and BPE token counts side by side.

Focus Word
tokenizer
Merge Step
0 / 0
BPE Tokens
0
Length Savings
0%

1. Build a tokenization example

This demo uses a small lowercase toy vocabulary. It is not a full production tokenizer, but it shows the core BPE idea used in modern subword tokenization: repeatedly merge the highest-priority adjacent pair.

2. Step through merges

Step 0 shows the focus word split into characters plus an end-of-word marker.

Current merge: None yet
Current Symbol Stream
The highlighted pair is the next adjacent pair chosen by BPE.
Step 0
Chosen Pair
Merged Result
Why This Matters
Character-level tokenization is flexible, but it creates long sequences. BPE compresses frequent patterns into larger subword units while keeping the ability to fall back to characters for unfamiliar words.
Merge Rules
Toy merge ranks. Lower rank means the pair gets priority earlier.
0 rules

3. Compare tokenization strategies

The exact numbers here come from the toy vocabulary used in this page, but the tradeoff is the same in real tokenizers: BPE usually lands between pure characters and whole words.

Character-Level
0 tokens

Maximum flexibility, but the longest sequence length.

Word-Level
0 tokens

Shorter sequences, but brittle on unseen words and spelling variants.

Toy BPE
0 tokens

Reusable subwords reduce sequence length while still backing off to smaller pieces.