The standard pricing unit for large language models is the token — roughly three-quarters of a word in English. When you send a message to an API like OpenAI's or Anthropic's, you're charged separately for input tokens (what you send) and output tokens (what the model generates). Output tokens cost more because they require sequential computation — the model has to generate them one at a time, which is slower and more GPU-intensive than processing input tokens in parallel. As of early 2026, prices for frontier models range from about $2–15 per million input tokens and $8–60 per million output tokens, depending on the provider and model tier. That might sound cheap until you realize that a busy application serving 100,000 users could easily consume billions of tokens per month.
AI pricing has fallen faster than almost anyone predicted. OpenAI's GPT-3.5 launched in early 2023 at $2 per million tokens; by mid-2024, models of equivalent quality were available for $0.10–0.25 per million tokens from providers like DeepSeek, Mistral, and Google (via Gemini Flash). This roughly 10–50x price reduction in 18 months came from three converging forces: hardware improvements (H100s are ~3x more efficient than A100s for inference), software optimizations (continuous batching, speculative decoding, and quantization), and competitive pressure (DeepSeek's open-weight models forced commercial providers to cut margins). The pattern continues — each new generation of inference chips and serving frameworks pushes costs lower. For developers, this means the model that was too expensive for your use case six months ago might be affordable today.
Not everything fits neatly into per-token pricing. Image generation models like DALL-E and Stable Diffusion charge per image (typically $0.02–0.08 per image depending on resolution). Video models charge per second of generated video — Runway's Gen-3 runs about $0.05 per second, which adds up fast for longer clips. Speech models charge per character or per minute of audio. Embedding models charge per token but at far lower rates than generative models (often $0.01–0.10 per million tokens). Some providers offer subscription models: ChatGPT Plus at $20/month, Claude Pro at $20/month, giving users unlimited (within rate limits) access to the latest models. For enterprise customers, committed-use discounts — agreeing to spend $100K+ per year in exchange for 20–40% off list pricing — are standard. And several providers offer generous free tiers: Google's Gemini API, Mistral's La Plateforme, and Groq all let developers experiment for free up to certain usage thresholds.
The single biggest lever for reducing AI costs isn't haggling with your provider — it's choosing the right model for the task. A frontier model like Claude Opus or GPT-4o is overkill for classification, extraction, or simple summarization; a smaller model like Claude Haiku, Gemini Flash, or Mistral Small can handle those tasks at 10–50x lower cost with comparable accuracy. Prompt engineering matters too: a system prompt that's 2,000 tokens long costs you those tokens on every single API call, so trimming it saves money at scale. Caching is another powerful tool — Anthropic's prompt caching and OpenAI's automatic caching both let you pay reduced rates for repeated context, which is especially valuable for applications that send the same system prompt or document context with every request. Finally, batching non-urgent requests (using OpenAI's Batch API or similar offerings) typically gives you a 50% discount in exchange for accepting higher latency.
Token pricing is the visible cost, but it's not the whole picture. Context window usage matters enormously: stuffing a 128K-token context window with documents on every call is technically possible but financially painful. Reasoning models like OpenAI's o1 and o3 generate internal "thinking" tokens that you pay for even though you never see them — a single complex query can consume 10,000+ thinking tokens on top of the visible response. Rate limits impose a hidden cost too: if your provider caps you at 1,000 requests per minute and your application needs 5,000, you either queue requests (adding latency) or provision multiple API keys (adding complexity). And don't forget egress costs, logging costs, and the engineering time spent building retry logic, token counting, and cost monitoring. The sticker price per token is just the beginning of the real cost equation.