Google ने Cloud Next 2026 पर अपने 8वीं पीढ़ी के TPU को training (TPU 8t) और inference (TPU 8i) chips में विभाजित किया: native FP4 के साथ 9,600-chip training pods, Ironwood से 3× SRAM के साथ 1,152-chip inference pods

Google ने Cloud Next 2026 पर अपने आठवीं पीढ़ी के TPUs का अनावरण किया उस architectural shift के साथ जो एक साल से चर्चा में था: training और inference workloads को अलग chips में विभाजित करना। Training के लिए TPU 8t, inference के लिए TPU 8i। प्रत्येक AI workload के अपने आधे हिस्से की विशिष्ट bottlenecks के लिए optimized है — training विशाल pods के पार raw throughput और interconnect bandwidth चाहता है, inference autoregressive decoding के लिए latency और memory-access locality चाहता है।

TPU 8t pods 9,600 chips के हैं, Ironwood के 9,216 से ऊपर, 3D torus network के माध्यम से connected। Architectural additions SparseCore (sparse ops के लिए acceleration, जो MoE models में dominate करते हैं) और native चार-बिट floating point (memory bandwidth pressure कम करते और memory byte प्रति effective throughput बढ़ाते) हैं। Google का दावा है large-scale training के लिए Ironwood vs 2.7 गुना performance-per-dollar और पिछली generation vs 2 गुना performance-per-watt। विस्तृत FLOPS numbers और HBM specs अभी public नहीं हैं।

TPU 8i अधिक दिलचस्प architectural move है। Pod size Boardfly ICI नामक एक नई interconnect topology का उपयोग करते हुए 1,152 chips पर शीर्ष पर है। Chip में Ironwood की तीन गुना SRAM है। Design choice KV cache और activations को कम latency autoregressive decoding के लिए on-chip रखने के बारे में है। inference पर dominate करने वाले all-reduce और all-to-all patterns के लिए विशेष रूप से एक Collectives Acceleration Engine है, और Boardfly all-to-all communication के लिए आवश्यक hops को 50% तक कम करता है। Inference chip के लिए Google का दावा: low-latency targets पर Ironwood से 80% perf-per-dollar, पिछली generation से 2 गुना perf-per-watt।

Builders के लिए दो बातें register करने लायक। एक, silicon level पर training-बनाम-inference विभाजन उस बात की hardware मान्यता है जो हर LLM serving paper दो साल से कह रहा है: prefill और decode, training और serving, अलग compute और memory profiles हैं और अलग silicon से लाभ पाते हैं। Anthropic का Amazon Trainium deal (10 लाख-plus chips deployed, दशक में 5 gigawatts) Amazon silicon पर वही logic दिखाता है। अब Google उसी तरह बंट रहा है। दो, उसी हफ्ते Thinking Machines Lab का Google Cloud के साथ multi-billion-dollar deal, NVIDIA GB300 chips के लिए, सुसंगत signal है: Google उसी cloud के माध्यम से अपना स्वयं का silicon और NVIDIA का silicon बेचता है, क्योंकि ग्राहक विकल्प चाहते हैं। Custom silicon margin जीत रहा है लेकिन exclusivity अभी नहीं।

और समाचार