GPU: Definition & Meaning — AI Wiki

Originally designed for rendering graphics, GPUs turned out to be perfect for AI because they can do thousands of math operations simultaneously. Training and running AI models is essentially massive matrix multiplication — exactly what GPUs are built for. NVIDIA dominates this market.

Why it matters

GPUs are the physical bottleneck of the entire AI industry. Why models cost what they cost, why some providers are faster than others, why there's a global chip shortage — it all comes back to GPU supply and VRAM.

Deep Dive

The reason GPUs dominate AI isn't raw speed on any single calculation — a CPU actually handles individual operations faster. The advantage is parallelism. A modern CPU has 8-64 cores; an NVIDIA H100 has 16,896 CUDA cores. Neural networks are built on matrix multiplications, where you're doing the same operation on thousands of independent data points simultaneously. That's exactly the workload GPUs were designed for back when their job was calculating the color of millions of pixels every frame. The AI community just happened to notice that the same hardware architecture was perfect for training neural networks, and the modern GPU compute era was born.

The CUDA Moat

NVIDIA's dominance in AI GPUs isn't just about hardware — it's about CUDA, the software ecosystem they've been building since 2006. CUDA is the programming framework that lets developers write code for NVIDIA GPUs, and virtually every major AI framework (PyTorch, TensorFlow, JAX) is built on top of it. AMD makes competitive hardware with their MI300X (192GB of HBM3 memory), and they've got ROCm as their CUDA alternative, but the ecosystem gap is enormous. Most AI researchers and engineers have spent years writing CUDA code and aren't eager to port it. Google's TPUs (Tensor Processing Units) are the other major player, but those are only available through Google Cloud — you can't buy one.

The Hardware Tiers

The GPU landscape has clear tiers. On the datacenter side, NVIDIA's H100 (80GB HBM3) has been the workhorse of AI training since 2023, with the H200 (141GB HBM3e) offering more memory for larger models. The B200 and GB200 represent the next generation. For inference specifically, the L40S (48GB GDDR6X) offers a cheaper alternative when you don't need the raw training throughput. On the consumer side, the RTX 4090 with 24GB of GDDR6X is the king of local AI — enough VRAM to run quantized 14B-parameter models comfortably, though training anything serious on it is impractical. The gap between consumer and datacenter isn't just VRAM — it's memory bandwidth. An H100 pushes over 3 TB/s of memory bandwidth versus the 4090's 1 TB/s, and for large language model inference, memory bandwidth is often the actual bottleneck.

Scaling Beyond One Card

One thing practitioners learn quickly is that "having a GPU" and "having enough GPU" are very different situations. Running inference on a single model is one thing, but training a modern LLM requires multiple GPUs working together, connected by high-speed interconnects like NVLink or InfiniBand. An 8-GPU H100 node (DGX H100) costs around $300,000 and can train a 70B model — but frontier models like GPT-4 or Claude likely required thousands of GPUs for months. This is why cloud GPU rental (from providers like Lambda, DataCrunch, CoreWeave, or the hyperscalers) has become the standard approach: you rent a cluster for your training run and give it back when you're done, rather than buying hardware that will be outdated in two years.

GPU

Why it matters

Deep Dive

The CUDA Moat

The Hardware Tiers

Scaling Beyond One Card

Related Concepts