LightSeek TokenSpeed: 580 tps Qwen3.5-397B-A17B on B200, MIT-licensed OSS

LightSeek Foundation released TokenSpeed, an open-source inference engine under MIT license that reports 580 tok/s single-user throughput on Qwen3.5-397B-A17B with NVFP4 quantization, running 8-way tensor-parallel on NVIDIA B200. The agentic workload they benchmarked is the right shape: 50K first-turn context, 10-15 turns of 800 tokens each, >90% KV cache hit rate. The positioning is "TensorRT-LLM performance with vLLM usability" — built from scratch with SPMD architecture and static compilation.

Three optimization categories carry the speed. Memory copy elimination uses hybrid prefix caching across KV pages and Mamba state slots (Qwen3.5's linear-attention layers maintain recurrent state, which TokenSpeed checkpoints alongside KV), index indirection via current_input_indices instead of tensor copies during speculative decoding, and copy-on-write semantics so cached checkpoints are reused without mutation. Kernel fusions collapse multi-stage ops: GemmaRMSNorm AllReduce goes from 3 kernels to 1, QK-RMSNorm + Partial RoPE + Gate Split from 5 to 1 Triton kernel with intermediates staying in registers, MoE Gate-Sigmoid-Mul-Add from 5 to 1. Overlapped CPU-GPU execution uses CUDA graph capture, async H2D with pinned memory, event-based layer barriers, and GPU-side sentinels to kill D2H round-trips. The long-context curve is the headline number worth marking: 128K at ~530 tok/s, 256K at ~495 tok/s, 1M at ~445 tok/s — 16% degradation across an 8× context expansion.

The ecosystem read for builders is two-fold. First, agentic-workload-shaped inference is becoming a distinct category from generic prompt completion. The optimizations TokenSpeed shipped — prefix-cache-aware design, multi-turn KV reuse, Mamba/GDN state caching — are tuned for the regime where the same context grows across turns, which is exactly the regime LLM agents live in. Single-batch numbers are the cleanest signal for this workload because real agent traces are usually serial per-user. Second, the methodology gap is real: no head-to-head numbers against vLLM, SGLang, or TensorRT-LLM on the same Qwen3.5 NVFP4 setup are published, which means the "580 tps record" framing needs reproduction by independent runners. The MIT license and public GitHub at lightseekorg/tokenspeed enable that reproduction, which is the methodological win regardless of whether the headline holds.

If you run agentic inference on hybrid-architecture models Monday morning: TokenSpeed is worth a reproduction run on your specific workload, particularly if you have a B200 cluster and NVFP4-aware tooling. If you build inference SaaS: the agentic-workload optimization category — prefix caching that survives multi-turn state churn — is now visibly separate from batch-prompt throughput. The engines that win agent serving will not be the same ones that win throughput benchmarks.

LightSeek TokenSpeed: 580 tps Qwen3.5-397B-A17B on B200, MIT-licensed OSS

More News