AI 基礎設施: Definition & Meaning — AI Wiki

規模化訓練和部署 AI 模型所需的完整硬體、軟體、服務堆疊。包括 GPU 和自訂晶片、資料中心、網路、儲存、編排平台(Kubernetes、Slurm)、模型服務框架(vLLM、TensorRT)、以及把這一切打包的雲端供應商。AI 基礎設施是模型架構的抽象世界和電網與冷卻系統這些非常具體的世界交匯的地方。

為什麼重要

基礎設施決定了什麼是可能的。只有少數幾家公司能訓練前沿模型的原因不是缺少想法 — 是缺少基礎設施。AI 對終端使用者花這麼多錢的原因,直接追溯到 GPU 可得性、資料中心容量、推理服務效率。

Deep Dive

AI infrastructure looks nothing like traditional cloud computing, even though it runs inside the same data centers. A conventional web application is CPU-bound and memory-light — a few cores, a few gigabytes of RAM, maybe a modest database. AI workloads invert that profile entirely. Training a frontier model like GPT-4 or Claude requires thousands of GPUs running in parallel for weeks, connected by ultra-fast interconnects (InfiniBand or NVLink) so they can synchronize gradients without bottlenecking. The networking alone can cost more than the servers in a traditional setup. This is why companies like NVIDIA, with their DGX SuperPOD systems, and cloud providers like CoreWeave and Lambda Labs have built entire businesses around GPU-first infrastructure that would look absurd in any other context.

The Training Stack

Training infrastructure is dominated by a handful of hardware configurations. NVIDIA's H100 and H200 GPUs are the workhorses, typically deployed in clusters of 8 per node (connected via NVLink) with hundreds or thousands of nodes linked by InfiniBand networking. Google has its TPU pods (v5e and v6), Amazon has Trainium chips, and Microsoft has its custom Maia accelerator — but NVIDIA still commands roughly 80% of the AI training market. On the software side, distributed training frameworks like DeepSpeed, Megatron-LM, and PyTorch FSDP handle the parallelism strategies (data parallel, tensor parallel, pipeline parallel) that let a model too large for one GPU spread across an entire cluster. Orchestration typically runs on Kubernetes with GPU-aware scheduling, or Slurm for traditional HPC-style batch workloads. The entire stack — from silicon to scheduler — has to work in concert, and a single slow node or flaky network link can tank the performance of a thousand-GPU training run.

Inference Is a Different Beast

If training is a construction project, inference is a restaurant kitchen — it's about throughput, latency, and cost per request at scale. Inference infrastructure has its own specialized tools: vLLM and TensorRT-LLM for serving large language models with techniques like continuous batching and PagedAttention; Triton Inference Server for multi-model serving; and quantization tools that shrink models from 16-bit to 4-bit precision so they fit on cheaper hardware. The economics are stark: serving a model at full precision on H100s might cost $3 per million tokens, but running a quantized version on consumer GPUs or custom inference chips could bring that under $0.20. 公司 like Groq (with their LPU chips), Cerebras (wafer-scale engines), and SambaNova (dataflow architecture) are all betting that purpose-built inference hardware will eventually undercut GPUs for serving.

The Build-vs-Buy Decision

For most organizations, AI infrastructure is not something you build — it's something you rent. The hyperscalers (AWS, Azure, Google Cloud) offer GPU instances on demand, and specialized providers like CoreWeave, Lambda, and DataCrunch offer better GPU pricing with fewer extras. On-premise GPU clusters make sense only at massive scale: Meta operates over 600,000 H100s, and xAI's Memphis data center runs 100,000 GPUs under one roof. Below that scale, the operational overhead of managing GPU hardware — dealing with thermal throttling, GPU failures (H100s fail at roughly 1–3% per year), driver updates, and power management — rarely justifies the capital expense. The real infrastructure skill for most teams isn't building clusters; it's choosing the right provider, optimizing batch sizes, and knowing when to use a smaller model that runs on a single GPU instead of throwing hardware at the problem.

Where It's Heading

The infrastructure landscape is shifting fast. Custom silicon is proliferating — every major cloud provider now has or is building its own AI chips, chasing NVIDIA's margins. Inference-optimized hardware is separating from training hardware, because the workload profiles are so different. Edge inference is growing, with models running on phones (Apple's Neural Engine, Qualcomm's Hexagon) and laptops (Intel's NPU, AMD's XDNA) rather than in the cloud. And the rise of AI agents — systems that make multiple model calls per task — is multiplying inference demand in ways that are straining current capacity. The companies that control AI infrastructure today control the pace of AI progress, which is exactly why Microsoft, Google, and Amazon are each spending over $50 billion per year on data centers.

AI 基礎設施