Zyphra TSP folds TP + SP on one GPU axis: 2.6× throughput on 1,024 MI300Xs

Zyphra has released TSP — Tensor + Sequence Parallelism — a parallelism strategy that collapses what used to be two orthogonal axes (TP sharding weights, SP sharding activations) onto a single device-mesh axis. The architectural choice that matters: each GPU holds 1/D of the model weights *and* 1/D of the sequence tokens, where D is the axis size. Both parameter memory and activation memory drop by the same 1/D factor on the same hardware. The validated configuration: 7B dense decoder-only transformer (h=4096, 32 layers, 32 Q/KV heads, FFN×4, bf16) on 1,024 AMD MI300X GPUs at 128K sequence length with D=8. Reported throughput: 173M tokens/sec versus 66.3M for the matched TP+SP baseline — a 2.6× improvement.

The communication strategy is where the engineering substance lives. For attention: weight shards broadcast iteratively, each GPU applies them to its local tokens, then K/V tensors get all-gathered using a zigzag partition for load balancing. For MLP: a ring schedule circulates weight shards via point-to-point operations, *eliminating* the all-reduce that standard TP requires. The single-node memory comparison at 128K tokens (8× MI300X): 38.8 GB/GPU under TSP versus 70.0 GB under plain TP and 85-140 GB under various TP+SP variants. That memory headroom is what unlocks longer-context training/inference at this dense model size on this hardware. Paper at arxiv.org/pdf/2604.26294; technical writeup at zyphra.com/post/tsp.

Two ecosystem signals. First, the result was validated on 1,024 MI300X — not on H100 — which is consistent with the broader neocloud story: AMD's silicon is showing up in production-class research clusters when the software stack is good enough, and Zyphra's apparently is. Second, the architectural choice (shard both weights and activations on the same axis rather than orthogonal axes) is the kind of simplification that opens new design space for parallelism. PTD-P (Megatron-LM) and FSDP have been the default playbooks for years; TSP doesn't replace them, but it widens the set of hardware/model combinations where folded sharding can beat orthogonal sharding. If you've been running TP+SP at small-to-medium model scale on AMD or NVIDIA, TSP is worth a benchmark pass on your specific config.

For builders training or serving large models, the take-home is concrete. Memory headroom of 38.8 GB versus 70-140 GB at 128K context means you can either run longer contexts on the same hardware or fit larger models on the same memory budget. The 2.6× throughput claim is config-specific (1,024 MI300X, 7B dense, D=8); at smaller scales or on H100/H200, the numbers will differ — read the paper, run it on your shape. The MLP-without-all-reduce trick is portable: even if you don't adopt TSP wholesale, eliminating that all-reduce in your existing TP setup is the kind of win worth pulling out as a standalone optimization. Zyphra hasn't released code as of this writeup; that's the next thing to watch for.

Zyphra TSP folds TP + SP on one GPU axis: 2.6× throughput on 1,024 MI300Xs

More News