Groq: Definition & Meaning — AI Wiki

A chip company building custom AI inference processors called LPUs (Language Processing Units). Unlike NVIDIA GPUs, which are general-purpose parallel processors adapted for AI, Groq's LPUs are purpose-built for the sequential token generation that LLM inference requires. The result: extremely fast inference speeds, often 10x faster than GPU-based alternatives for LLM generation.

Why it matters

Groq demonstrated that LLM inference doesn't have to be slow. Their cloud API serves open models (Llama, Mixtral) at speeds of 500–800 tokens per second — fast enough that responses appear nearly instantly. This speed advantage comes from hardware architecture, not software optimization, suggesting that the current GPU-centric approach to AI inference may not be the long-term winner.

Deep Dive

The LPU (Language Processing Unit) is built around a deterministic execution model. Unlike GPUs, which schedule work dynamically and suffer from memory bandwidth bottlenecks, LPUs have a fixed dataflow architecture where computation and data movement are orchestrated at compile time. This eliminates scheduling overhead and allows the chip to sustain near-peak throughput for the sequential, memory-bound operations that dominate LLM inference (especially token generation, which is limited by how fast you can read model weights from memory).

The Trade-offs

Groq's speed advantage comes with constraints. The deterministic architecture works best for models that fit a known execution pattern — standard Transformer inference. Custom architectures, training workloads, and highly dynamic computation graphs are harder to map to the LPU. Groq is also an inference-only solution; you still need GPUs (or TPUs) for training. And the cost-per-token, while decreasing, isn't always cheaper than GPU inference for high-throughput batch workloads where GPUs can amortize their flexibility.

Groq

Why it matters

Deep Dive

The Trade-offs

Related Concepts