Neoclouds unlock AMD's AI GPUs: MLPerf record beats H100 by ~25% on Llama2-70B

AMD has had competitive AI silicon for two years (MI300X with 192GB HBM3, MI325X with 256GB HBM3E, now MI355X with 288GB HBM3E and 8TB/s memory bandwidth on 4th-gen CDNA). The reason enterprises mostly haven't moved is that the software stack — ROCm, kernel coverage, vLLM/SGLang ports, scheduling — has lagged Nvidia's CUDA ecosystem by a margin big enough to erase the hardware advantage. The story now is that *neocloud* providers — TensorWave, MangoBoost, Crusoe — are closing that gap themselves rather than waiting for AMD or the open-source community, and the public proof points are starting to land.

The headline result: MangoBoost's LLMBoost software stack hit 103,182 tokens/sec offline on Llama2-70B in MLPerf Inference v5.0 across 32× MI300X (four 8-GPU nodes), versus the previous H100 record of 82,749 TPS — about 25% higher throughput. They credit three things: multi-dimensional parallelism, dynamic scheduling across the 8 GPUs per node, and a streamlined interface that they claim runs 5.2-6.0× faster than plain vLLM on the same hardware. MangoBoost's own pricing math (caveat: their numbers, not independently audited) — MI300X at $15-17K vs H100 at $32-40K — works out to roughly 2.8× more inference throughput per $1,000 spent. TensorWave is among the first cloud providers deploying MI355X in production, running the largest AMD AI training cluster in North America at 8,192× MI325X under direct liquid cooling. MI355X cloud pricing across five providers (TensorWave, Crusoe, Vultr, and others) currently lands $2.29-$8.60/hr per GPU.

The pattern is what builders should track. AMD's gap was infamous — capable hardware that nobody could deploy productively because kernels weren't there, schedulers weren't tuned, framework support was patchy. The traditional answer would be "AMD fixes it" or "the open-source community fixes it" — both have been moving, but slowly. Neoclouds are a third path: vertically-integrated providers who own both the software optimization *and* the deployment surface, capturing margin from the cost-per-token gap they create. That's structurally different from the Nvidia-plus-hyperscaler stack, where Nvidia owns the software and the hyperscalers run the hardware. AMD's path is fragmented by design, and that fragmentation is finally working in their favor — when no single platform owner controls the optimization story, specialized players can win on focused effort.

If you ship LLM inference at scale and got locked to H100/H200 because the AMD path looked too rough, the math has changed. Test the actual workload on MI300X via MangoBoost or MI355X via TensorWave/Crusoe before signing the next Nvidia procurement. The MLPerf number isn't the whole picture — your latency profile, kernel coverage for your specific model architecture, and operations team ROCm familiarity all matter — but ~2.8× inference throughput per dollar is a number that justifies a full benchmarking pass. The LLMBoost stack is the load-bearing software layer; if you're running plain vLLM on MI300X and getting unimpressive numbers, that's because plain vLLM isn't the optimized path. The signal isn't "AMD won." It's "the software lock-in argument for Nvidia is weaker than it was a year ago, and neoclouds are the reason.

Neoclouds unlock AMD's AI GPUs: MLPerf record beats H100 by ~25% on Llama2-70B

More News