TLX Block Attention on B200: 2.31x vs FA v2, +30.6% MFU, ads-models only

PyTorch and Meta's ads-ranking team dropped a Blackwell-specific attention kernel this week — TLX Block Attention — and the bigger story is the layer below it. **TLX (Triton Language Extensions)** is the DSL bridge between Triton's productivity and raw CUTLASS-level control on Blackwell, exposing the new tcgen05 async tensor cores, TMA descriptors and TMEM (256KB-per-SM Tensor Memory) as Triton primitives like `tlx.async_dot`, `tlx.async_descriptor_load`, `tlx.local_trans`, plus mBarrier synchronization for producer-consumer warp pipelines. Repo: github.com/triton-lang/triton-ext. This is the layer most builders writing Blackwell kernels in 2026 will live in once it stabilises.

The kernel itself targets fixed-block sparse self-attention — 64-token blocks, block-diagonal pattern, compile-time-known. That shape is specifically for Meta's ads-ranking and recommendation models, not LLM attention. Because the pattern is known at compile time, the kernel eliminates Flash Attention's multi-tile iteration loop, online-softmax correction factors, logsumexp HBM round-trip, and separate Di preprocessing — each Q tile attends to exactly one K/V tile, single GEMM, no correction needed. Forward pass uses 15 warps per CTA in a specialized pipeline (1 load / 1 QK-MMA / 4 softmax / 1 PV-MMA / 8 epilogue); backward uses 20 warps across 7 stages. Triple-buffered TMEM in forward (~169KB / 256KB), double-buffered in backward (~162KB / 256KB). B200, BF16, sparsity=70% — forward 0.98ms vs Flash Attention v2's 1.81ms (1.85×), backward 2.36ms vs 5.89ms (2.50×), total 2.31×. Numerical accuracy beats FA v2 by 53% on max dQ diff.

The fused rotary backward is the second highlight and the generalizable pattern. Standalone attention backward 1.56ms plus rotary backward 4.88ms = 6.44ms unfused; fused into a single kernel that keeps dV in FP32 within TMEM/registers, applies the rotary conjugate in-place, then does one BF16 global store = 1.82ms. **3.54× faster.** The lesson is portable beyond ads workloads: when you have FP32 intermediate values in registers/TMEM, doing your epilogue math at FP32 and storing BF16 once eliminates round-trips through global memory that otherwise dominate. That's the kind of insight builders can apply to other fused-op kernels even without TLX or Blackwell.

Monday morning: this kernel as-shipped is useful to you if you ship ad-ranking, recsys or feature-interaction models with block-diagonal attention on B200/B300 GPUs — clone facebookresearch/ads_model_kernel_library and benchmark. If you're an LLM builder, the kernel doesn't apply (causal, sliding-window, and arbitrary sparse patterns are explicitly excluded), but the TLX DSL itself is the part to watch — it's how Blackwell-aware Triton kernels are going to be written, and most of the architectural primitives (warp specialization, TMA descriptors, TMEM accumulators) generalize to whatever attention shape your stack needs. Honest limits: Blackwell-only (sm_100+), no Ampere/Hopper fallback, head_dim hardcoded to 64 or 128, block size 64 fixed, license not stated in the blog (check the repo). For LLM-shaped attention with these techniques, the Blackwell-port of Flash Attention 3 and its successors will be the watch item over the next quarter.

TLX Block Attention on B200: 2.31x vs FA v2, +30.6% MFU, ads-models only

More News