Nous Research published Lighthouse Attention this week — a training-only hierarchical attention mechanism that pools queries, keys, and values symmetrically across a multi-level pyramid, runs top-K selection outside the kernel, and lets FlashAttention operate on a small dense sub-sequence. Reported wall-clock pretraining speedup: 1.40-1.69× end-to-end vs cuDNN-backed SDPA on a 530M Llama-3-style decoder, tested at 512K context on a single GPU and 1M tokens across 32 GPUs with context parallelism. Kernel-level speedup at 512K is sharper: 21× forward, 17.3× forward+backward. Authors: Peng, Ghosh, Quesnelle. arXiv 2605.06554, code at github.com/ighoshsubho/lighthouse-attention as a patch on torchtitan plus two new files.

The architectural choice that separates Lighthouse from prior NSA and HISA work is symmetric Q/K/V pooling, not just K/V. Earlier selection-based attention methods left queries at full resolution and pooled only the K/V side; Lighthouse pools all three into the pyramid and runs ℓ₂-norm chunked-bitonic top-K selection across them. The cost moves from O(N·S·d) to O(S²·d). The four-stage pipeline — average-pool into L levels, score and top-K, gather selected entries, run stock FlashAttention on the gather, scatter outputs back via a deterministic kernel — keeps the inner attention kernel exactly as it is on dense sequences. That is the practical reason FlashAttention's speedup compounds with Lighthouse's selection rather than fighting it.

The training-only positioning matters. Lighthouse is removed at inference: a two-stage training recipe trains with selection enabled in stage 1 and resumes under dense SDPA in stage 2. Final training loss 0.6980-0.7102 vs dense-from-scratch baseline 0.7237 — marginally better — at 22.5-27.0 hours wall-clock vs 37.9 hours for the dense baseline on the same model and token budget (~50.3B tokens, 16,000 steps). So the win is on the training-compute axis, not the inference-compute axis: a model trained with Lighthouse behaves like a normal dense model at deployment. This is a different problem statement from sparse-attention-at-inference work (StreamingLLM, KV cache compression) and from architecture-level sparse-attention shipped to production. Lighthouse is the "pretrain cheaper, deploy dense" point in the design space.

Monday: if you're pretraining a model at long context on commodity training infrastructure, Lighthouse is a torchtitan patch and two files away from being on your training run for ablation. The 530M-scale result is suggestive, not load-bearing — whether 1.4-1.7× holds at 7B, 70B, or 405B is the open question. The selection overhead (gather/scatter, top-K) doesn't scale linearly with model size, so the speedup could compress or expand. Watch for Nous itself replicating at scale, watch for whether the next Llama, Qwen, or DeepSeek pretrains adopt the symmetric-pyramid pooling trick, and watch the GitHub repo for a cuDNN-grade fused kernel that hasn't been published yet — that's where production-grade adoption gates.