Tokens are created by a tokenizer, a separate algorithm that runs before the neural network ever sees your text. The most common approach today is Byte Pair Encoding (BPE), used by GPT, Claude, and Llama. BPE starts with individual characters (or bytes) and iteratively merges the most frequent pairs into new tokens. After enough merges, common words like "the" or "and" become single tokens, while rare or specialized words get split into subword pieces. The word "tokenization" itself might become "token" + "ization" or "token" + "iz" + "ation" depending on the specific tokenizer. This subword approach is what makes modern models handle misspellings, neologisms, and code reasonably well — they never encounter a truly "unknown" word, just unfamiliar combinations of known pieces.
Different models use different tokenizers with different vocabularies, and this matters more than most people realize. GPT-4's tokenizer (cl100k) has around 100,000 token types. Claude's tokenizer is different. Llama uses yet another. The same English sentence can tokenize to a different number of tokens depending on which model you are using, which directly affects context window usage and API costs. Code tends to be less token-efficient than prose because variable names and syntax tokens may not appear frequently enough in training data to earn their own vocabulary entry. Non-English languages vary wildly — languages with Latin scripts generally tokenize almost as efficiently as English, but Chinese, Japanese, Korean, Arabic, and Hindi often require more tokens per equivalent meaning because their characters may not have been as heavily represented during tokenizer training.
The tokenizer's vocabulary size creates a real engineering trade-off. A larger vocabulary means common words and phrases get their own dedicated tokens, so your text compresses into fewer tokens (cheaper, faster, fits more in context). But a larger vocabulary also means a bigger embedding table at the model's input and output layers, which increases model size and memory usage. The embedding table for a vocabulary of 100,000 tokens at a model dimension of 4,096 is already 400 million parameters — a nontrivial chunk of a smaller model. This is why vocabulary sizes tend to cluster in the 32K–128K range: it is the sweet spot between compression efficiency and parameter overhead.
When providers advertise context windows — 8K, 128K, 1M tokens — those numbers include everything: your system prompt, your conversation history, any documents you paste in, and the model's own response. A common developer mistake is stuffing the context window full of reference material and leaving too few tokens for the model to generate a substantive reply. Most APIs let you set a max_tokens parameter for the response, but if your input already consumed most of the context window, the model may truncate its thinking or refuse to answer. In practice, you want to budget: know your model's context limit, estimate your input size (the 3/4 word rule is a rough guide — for precision, use the provider's tokenizer library), and reserve enough room for the output you need.
There is also a cost dimension most people underestimate. Output tokens are typically 3–5x more expensive than input tokens on API pricing tiers, because generating each output token requires a full forward pass through the model, while input tokens can be processed in parallel. This asymmetry means that a chatbot giving long, verbose answers costs dramatically more than one trained to be concise. It is also why techniques like prompt caching (reusing the processed input tokens across multiple requests) can cut costs significantly for applications that share a common system prompt or document context across many queries. Understanding the token economics is not just academic — it is the difference between an AI feature that costs $50/month to run and one that costs $5,000.