Nvidia's real moat is CUDA — and the few engineers who can write kernels, Zubnet AI News

Wired's Sheon Han spent a day writing CUDA and emerged with a builder-relevant take: Nvidia's moat is not the H100 or B200 silicon, it is CUDA — the platform layer Ian Buck and John Nickolls began assembling at Nvidia in the mid-2000s, and the bundled libraries that have accreted on top of it ever since. A matrix multiplication that takes three lines in PyTorch took Han over fifty in CUDA. That ratio is the moat. PyTorch, TensorFlow and JAX are all CUDA-first; on AMD's MI300X — which on paper has more cores and more memory than an H100 — those same frameworks underperform because the kernels were tuned for Nvidia silicon, not because the hardware is slower. Independent benchmarks consistently bear that out.

Beneath CUDA sits PTX, Nvidia's pseudo-assembly. DeepSeek's V3 training run famously dropped below the CUDA abstraction and wrote PTX directly to squeeze out throughput Nvidia's own libraries left on the table. That is the existence proof that the moat is drainable. The catch is that the global population of engineers who can do this work is small, and a meaningful share of them work at Nvidia. AMD's ROCm has shipped for years and its subreddit still reads like a support group. Intel's oneAPI is on life support. OpenCL — backed once by Apple, AMD and Qualcomm — never gained traction. The only credible challenger right now is Modular, Chris Lattner's company building Mojo and MAX, and Modular is still a long way from displacing PyTorch's CUDA dependency in production pipelines.

For the wrapper economy and the open-stack crowd this is the un-glamorous reality: every "we run on AMD too" claim should be read as "we tolerate a performance gap, mostly invisible at inference, ugly at training." Frameworks like vLLM and SGLang are CUDA-tuned by default; AMD ports exist but lag. The deep-stack consequence is that any provider promising hardware-neutral inference is paying the CUDA tax in one of two ways — slower kernels on competing chips, or an engineering team trying to write its own PTX. That second option is what makes DeepSeek's R1 and V3 economics work; very few labs have the staff to repeat it. Even coding agents stumble on kernel code, which means the "AI writes its own kernels" path that would dissolve the moat is not yet operational.

For a Monday-morning builder: if your stack is Nvidia-only, the moat is paying for itself in performance you'd otherwise lose. If you are betting on AMD, Intel or a startup accelerator to break the lock-in, watch two signals — Modular's adoption inside actual training pipelines (not benchmarks), and whether OpenAI's Triton or Meta's PyTorch 3 abstract enough of the kernel layer to make hardware swaps cheap. Until one of those shifts, the Han framing holds: Nvidia is a hardware company because it is a software company first, and the software layer is twenty years deep.

Nvidia's real moat is CUDA — and the few engineers who can write kernels

More News