The WSE-3 has 44 GB of on-chip SRAM — not HBM or DRAM, but SRAM directly on the compute die. This provides ~21 PB/s of memory bandwidth, orders of magnitude more than GPU HBM bandwidth. For memory-bandwidth-bound operations (like LLM inference, which is limited by how fast you can read model weights), this is a fundamental advantage. The trade-off: 44 GB of on-chip memory can't hold the largest models, requiring model-parallel strategies across multiple CS-3 systems.
Cerebras has demonstrated impressive inference speeds — serving Llama-70B at over 2,000 tokens/second, competitive with or exceeding Groq's LPU. The approach is different (wafer-scale chip vs. deterministic ASICs) but the result is similar: purpose-built hardware that dramatically outperforms GPUs for the specific workload of LLM token generation.