Token Visualizer

Deep dive into tokenization — how text becomes numbers, token types, context windows, and vocabulary.

Live Tokenizer

Each colored block is one token. Hover a token to see its ID. Spacing, punctuation, and capitalization all affect tokenization.

Characters

Tokens

Words

0.0

Tokens/Word

0.0

Avg Length

Unique Tokens

Token Visualization hover for ID

Why tokens instead of characters or words?
Character-level models need very long sequences. Word-level models have huge vocabularies and can't handle new words. BPE finds a middle ground: common words get single tokens, rare words get split into familiar subwords. "unforgettable" might become ["un", "forget", "table"].

Token Types

Modern tokenizers produce several distinct token types. Understanding them helps you predict how text will be chunked and why token counts vary.

BPE (Byte Pair Encoding) starts with individual characters, then iteratively merges the most frequent adjacent pairs. After thousands of iterations, common words become single tokens while rare/new words are handled as subword sequences. GPT models use ~50,000 tokens in their vocabulary; sentence-piece models vary from 32K to 256K.

Context Window Visualizer

Every LLM has a maximum context window — the number of tokens it can "see" at once. Drag the slider to fill the window and see what gets cut off.

Window size: Document tokens: 3,000

Window: 4,096 tokens

Used: 3,000 tokens

Remaining: 1,096

Token grid — purple = used / dark = available / red = truncated

BPE Vocabulary Sample

A sample of tokens from a typical BPE vocabulary. Notice: single characters, common words, common subwords, numbers, punctuation, and special tokens all coexist.

Vocabulary facts:
GPT-2/GPT-3: ~50,257 tokens | GPT-4: ~100,277 tokens | Claude: ~100k tokens | LLaMA 2: 32,000 tokens

How token ID assignment works: Tokens are ranked by frequency during training. ID 0–255 are usually byte fallbacks. Common English words get IDs in the hundreds. Special tokens like [CLS], [SEP], <|endoftext|> are assigned fixed IDs.