The Qwen team released FlashQLA on Tuesday under MIT, a high-performance kernel library targeting the Gated Delta Network (GDN) linear attention mechanism that powers the Qwen3.5 and Qwen3.6 model families. The headline benchmark: 2-3x forward pass and 2x backward pass speedup against the established Flash Linear Attention (FLA) Triton implementation, measured on Nvidia H200 across head dimensions matching Qwen's tensor-parallel configurations (TP1 through TP8, hv from 64 down to 8). Repository at github.com/QwenLM/FlashQLA. The substance is in what FlashQLA chose to build on: not Triton, but TileLang — a relatively new compiler framework that exposes Hopper-specific scheduling primitives that Triton cannot fully express.

The architectural context matters. Linear attention replaces standard softmax attention's O(n²) complexity with O(n), which becomes load-bearing as sequence lengths cross 100k tokens. GDN is a "gated" variant that applies an exponentially decaying gate over past context — a formulation that admits efficient kernel-level implementation but requires careful scheduling of memory movement, Tensor Core operations, and CUDA Core compute to actually deliver the theoretical efficiency. Qwen3.5/3.6 use a hybrid design: GDN layers alternate with standard full attention, getting the expressiveness of full attention where needed and the efficiency of linear attention everywhere else. FlashQLA optimizes the linear-attention half of that stack specifically — meaning the gain compounds with hybrid architectures, not just pure linear-attention models.

The Triton-vs-TileLang dimension is the broader signal. Triton (OpenAI's Python-based GPU programming language) democratized kernel writing — most production ML kernels including FlashAttention's reference implementation rely on it. But Triton's abstraction targets a generic CUDA programming model, which doesn't fully expose Hopper-specific features: warpgroup-level Tensor Core operations, asynchronous data pipelines, and warp specialization that lets you split a kernel across 128-thread warpgroups assigned to specialized roles (one moves data, one runs Tensor Cores, one runs CUDA cores, all overlapping). FlashQLA uses TileLang's warp-specialized kernel primitives to manually orchestrate this overlap. The result is a kernel that's more brittle (Hopper-specific, requires SM90+ with CUDA 12.8+ and PyTorch 2.8+) but materially faster than what Triton can produce. We're back to a regime where serious kernel performance demands hand-tuned, hardware-specific implementations — Triton was a beautiful abstraction but it cost throughput on the latest silicon.

For builders, three takeaways. First, if you're running Qwen3.5/3.6 inference at scale on H100/H200, swapping FLA for FlashQLA is potentially free 2x decode throughput — but verify on your specific deployment because benchmarks were single-kernel latency, not end-to-end serving. Second, the Triton-vs-TileLang split signals a portability tax that'll keep widening: portable kernels run everywhere but slower, hardware-specific kernels require maintaining separate code paths per generation (SM89 Ada, SM90 Hopper, SM100 Blackwell). Frameworks like TileLang and CUTLASS will increasingly own the high-performance ceiling while Triton keeps the developer-friendly floor. Third, this is a tell about Qwen's infra team — shipping a hand-tuned kernel library alongside model weights is the kind of vertically-integrated optimization Western open-source teams have been slower to do. DeepSeek-V3 came with custom CUDA implementations; Qwen3.x now comes with a custom kernel library. The bar for "open weights" is quietly becoming "open weights plus the kernels you need to actually serve them efficiently." That's a meaningful upgrade to what open-source AI delivery looks like.