A single H100 GPU running Llama 70B hits 92% utilization during prefill, then crashes to 28% during decode—all within milliseconds of the same request. This isn't a bug, it's the fundamental mismatch between how LLMs work and how we deploy them. Prefill processes entire prompts in parallel through massive matrix multiplications that saturate tensor cores. Decode generates tokens one-by-one through memory lookups that barely touch compute resources. Yet most teams run both phases on identical GPU pools, paying for 64 H100s while getting meaningful work from maybe 20.

Disaggregated inference, pioneered in UC San Diego's 2024 DistServe paper, splits these workloads onto separate hardware optimized for each phase. The approach isn't theoretical—Perplexity runs it in production, Meta and LinkedIn serve traffic through it, and NVIDIA built their Dynamo framework around it. vLLM, SGLang, and TensorRT-LLM all support disaggregation natively. The promise is 2-4x cost reduction by right-sizing compute for actual workload requirements instead of worst-case scenarios.

The broader inference optimization landscape shows this architectural shift gaining momentum beyond academic papers. While I covered Cursor's Warp Decode claims for 1.8x speedups back in April—which lacked concrete proof—disaggregated inference delivers measurable cost improvements with production deployments you can actually verify. The LLM Inference Handbook notes that collocated prefill and decode creates scheduling conflicts where compute-heavy prefill blocks memory-bound decode, increasing both time-to-first-token and inter-token latency.

For developers running inference at scale, disaggregation requires rethinking your deployment architecture but offers real cost savings. If you're burning through H100 budgets on inference workloads, the hardware utilization mismatch is probably costing you more than the engineering effort to implement separate prefill and decode clusters." "tags": ["inference", "gpu-optimization", "llm-serving", "cost-reduction