Cloudflare has put a custom LLM inference engine into production across its global network. The engine is called Infire, the architectural choice underneath is prefill/decode disaggregation — splitting input processing and output generation onto different optimized machines — and the result is that Cloudflare is now hosting trillion-parameter open models like Kimi K2.5 (1T+ params, ~560GB on disk) at the edge, alongside Llama 4 Scout. The interesting part isn't the launch; it's that one of the largest CDNs has joined the small set of operators running their own non-vLLM, non-SGLang inference stack at scale.
The P/D split is the load-bearing architectural choice. Prefill is compute-bound: it processes the input prompt and populates the KV cache. Decode is memory-bound: it reads the KV cache and emits one token at a time. Putting both stages on the same machine means whichever stage isn't bottlenecking is wasting hardware. Infire separates the two onto machines optimized for each profile. On top of that, Infire combines pipeline parallelism (sharding across GPUs by model layer) with tensor parallelism (sharding within layers by tensor), with the explicit goal of preventing GPUs at one stage from starving while another stage executes. Hardware footprints are concrete: Kimi K2.5 needs at least 8 H100s (the model is ~560GB; remaining HBM goes to KV cache); Llama 4 Scout fits on 2 H200s with substantial context capacity left.
The second piece is Unweight, Cloudflare's weight-compression system that shrinks model weights by 15-22% without accuracy loss, reducing the amount of data moved across GPUs during inference. At the trillion-parameter scale, weight movement is a real cost dimension — every percentage point off the bytes-loaded number is real wattage and real latency. The bigger picture: Cloudflare is positioning to host frontier-scale open models as a generic infrastructure tier, the same way they host static assets. If Kimi K2.5 and Llama 4 Scout run on Cloudflare with credible cold-start and TTFT numbers, the cost-per-token math against renting your own H100 cluster shifts. The wrapper economy gets a new substrate, and "where do I run this 1T-param model" stops being a procurement project.
If you're shipping with open-weight frontier models and don't want to operate a GPU pool, Workers AI / Infire is now in a different competitive bracket than it was a year ago — test the same workload there versus your current provider, with TTFT and per-token cost as the meaningful comparison, especially for long-context coding-agent traces. If you operate your own inference stack, the P/D disaggregation pattern is the takeaway; pipeline + tensor parallelism in tandem (rather than picking one) is the implementation note. Unweight isn't open as far as I can find, so weight compression remains a build-or-buy decision. The competitive pressure on vLLM and SGLang to stay best-in-class just got more real.
