Rate Limiting: Definition & Meaning — AI Wiki

Restrictions on how many API requests you can make per minute/hour/day. Providers impose rate limits to prevent server overload and ensure fair access. Limits typically apply per API key and can restrict requests per minute (RPM) and tokens per minute (TPM).

Why it matters

Rate limits are the invisible ceiling you hit when scaling AI applications. They're why batch processing matters, why you need retry logic, and why some providers charge more for higher rate limits.

Deep Dive

Rate limiting in AI APIs operates on multiple dimensions simultaneously, and understanding each one prevents a lot of frustration. Most providers enforce at least two limits: requests per minute (RPM) and tokens per minute (TPM). RPM caps how many API calls you can make regardless of size — Anthropic's free tier might allow 5 RPM, while paid tiers offer 1,000+ RPM. TPM caps the total volume of tokens (input + output) flowing through per minute. You can hit either limit independently. A common surprise: you're well under your RPM limit but hitting TPM because you're sending long prompts with large context windows. Some providers also enforce requests per day (RPD) and tokens per day (TPD), creating a daily ceiling that resets at midnight UTC.

Under the Hood

The mechanics of how providers enforce these limits follow a few standard patterns. The most common is the token bucket algorithm (or its close cousin, sliding window). Imagine a bucket that holds, say, 60 tokens-worth of capacity. It refills at a rate of one per second. Each request drains from the bucket proportional to its token count. If the bucket is empty, your request gets rejected with an HTTP 429 (Too Many Requests). The response headers tell you what you need to know: x-ratelimit-limit-requests, x-ratelimit-remaining-requests, x-ratelimit-reset-requests, and their token equivalents. Smart client code reads these headers proactively rather than waiting to get 429'd. Anthropic, OpenAI, and most other providers include these headers on every response.

Handling the 429

When you do get rate-limited, the standard approach is exponential backoff with jitter. Wait 1 second after the first 429, then 2 seconds, then 4, then 8 — and add a random component (jitter) so that if 50 of your parallel workers all got 429'd at the same time, they don't all retry at the exact same moment and immediately get 429'd again. Most provider SDKs (Anthropic's Python SDK, OpenAI's SDK) handle basic retry logic automatically, but production systems usually need more sophisticated approaches: request queues with priority levels, adaptive rate limiting that throttles proactively based on remaining quota, and circuit breakers that fail fast when a provider is clearly overloaded rather than piling on more retries.

Architecting Around Limits

The strategic implications of rate limits shape how serious applications are architected. If you need to process 100,000 documents through Claude, you can't just fire off 100,000 concurrent API calls. You need to manage concurrency, probably running 20-50 parallel requests and feeding them from a queue. Anthropic offers a Batch API with a separate, higher-throughput rate limit at a 50% cost discount — specifically designed for this use case. OpenAI has a similar batch endpoint. For applications that need guaranteed capacity, enterprise tiers and committed-use agreements offer dedicated throughput that's shielded from the shared pool. The unspoken reality is that rate limits aren't just about fairness — they're about GPU allocation. Every request you make requires GPU time, and providers can only serve as many concurrent requests as they have GPUs for. Rate limits are the mechanism that keeps supply and demand in balance.

Rate Limiting

Why it matters

Deep Dive

Under the Hood

Handling the 429

Architecting Around Limits

Related Concepts