Deep dive into tokenization — how text becomes numbers, token types, context windows, and vocabulary.
Live Tokenizer
Each colored block is one token. Hover a token to see its ID. Spacing, punctuation, and capitalization all affect tokenization.
0
Characters
0
Tokens
0
Words
0.0
Tokens/Word
0.0
Avg Length
0
Unique Tokens
Token Visualization hover for ID
Why tokens instead of characters or words?
Character-level models need very long sequences. Word-level models have huge vocabularies and can't handle new words. BPE finds a middle ground: common words get single tokens, rare words get split into familiar subwords. "unforgettable" might become ["un", "forget", "table"].
Token Types
Modern tokenizers produce several distinct token types. Understanding them helps you predict how text will be chunked and why token counts vary.
BPE (Byte Pair Encoding) starts with individual characters, then iteratively merges the most frequent adjacent pairs. After thousands of iterations, common words become single tokens while rare/new words are handled as subword sequences.
GPT models use ~50,000 tokens in their vocabulary; sentence-piece models vary from 32K to 256K.
Context Window Visualizer
Every LLM has a maximum context window — the number of tokens it can "see" at once. Drag the slider to fill the window and see what gets cut off.
3,000
Window: 4,096 tokens
Used: 3,000 tokens
Remaining: 1,096
Truncated: 0 tokens
Token grid — purple = used / dark = available / red = truncated
BPE Vocabulary Sample
A sample of tokens from a typical BPE vocabulary. Notice: single characters, common words, common subwords, numbers, punctuation, and special tokens all coexist.
How token ID assignment works: Tokens are ranked by frequency during training. ID 0–255 are usually byte fallbacks. Common English words get IDs in the hundreds. Special tokens like [CLS], [SEP], <|endoftext|> are assigned fixed IDs.