LLM serving has been a one-rack problem for a reason. Prefilling a 32K-token request on an 8×H200 instance generates KV cache at about 60 Gbps. That number forces you to keep prefill and decode on the same rack, or at a stretch the same datacenter, because only RDMA fabric can move that kind of traffic without choking. The cost is that you pay for dense compute and high-bandwidth memory in the same GPU, even though the decode phase mostly needs the latter. Moonshot AI and Tsinghua, in their PrfaaS paper (arxiv 2604.15039), argue the colocation assumption is worth revisiting.
PrfaaS, short for prefill-as-a-service, sends long prefills to dedicated H200 clusters, then ships the resulting KV cache over commodity Ethernet to local decode clusters running cheaper H20 GPUs. Three things make the transfer tractable. First, hybrid attention: Ring-2.5-1T and MiMo-V2-Flash compress KV state roughly 36× versus dense-attention equivalents, which drops per-request cache egress from about 60 Gbps down to 5 Gbps. Second, layer-wise pipelining overlaps generation with transmission, so transfer begins before prefill finishes. Third, multi-connection TCP plus active congestion monitoring saturates the available VPC bandwidth (about 13 Gbps sustained on a 100 Gbps link). Routing is length-based: requests under 19.4K tokens stay local, longer ones route to remote prefill where the compute savings justify the round-trip.
The case-study numbers are hard to ignore. On 32 H200 prefill plus 64 H20 decode, PrfaaS hits 54% higher throughput than a homogeneous H200 baseline and 32% higher than a naive heterogeneous setup. Mean TTFT drops 50%, P90 TTFT drops 64%. Extrapolated to a 10,000-GPU datacenter, aggregate cross-cluster bandwidth reaches 1.8 Tbps. The architectural argument is bigger than the benchmarks — geo-distributed LLM serving has been blocked on KV transport, and if hybrid attention plus pipelined TCP is enough to move past that, the design space for where your prefill GPUs live suddenly opens up. Prefill in one region, decode in another, cheaper silicon everywhere.
If you run long-context inference at scale, two things from this paper are worth measuring against your own stack. One, the 19.4K-token routing threshold isn't magic; it is the point where PrfaaS's specific compute differential pays off, and your number will be different. Two, the hybrid attention compression ratios depend entirely on which model family you're serving; dense-attention models don't get 36× for free. But the broader claim, that commodity Ethernet is good enough for KV transport once you shrink the cache, is the kind of result that changes what "datacenter-bound" means for inference serving.
