Moonshot ships Kimi-K2.6 with 1T params, MoE 384/8 routing, and MLA attention — HLE-Full edges Claude Opus 4.6 and GPT-5.4

Moonshot AI shipped Kimi-K2.6 this week, the latest in a cadence that has made the Beijing lab one of the more consistent open-weights releases in the space. The release lands the same week as their PrfaaS serving-infrastructure paper, which suggests the training and serving sides of their stack are being pushed in coordination. Weights are up at huggingface.co/moonshotai/Kimi-K2.6. As usual with Moonshot, the technical claims are concrete enough to evaluate, even if the full model card is thinner than the architecture disclosure.

The architecture is a sparse mixture of experts. One trillion total parameters, 384 experts per MoE layer, eight experts active per forward pass. That puts the active parameter count in the same rough band as DeepSeek-V3's sparse routing, and the design choices rhyme across the rest of the stack: Multi-Head Latent Attention for the attention mechanism, which compresses the cached KV state into a lightweight latent representation and has been one of the more effective ways to cut serving memory on long-context workloads, and SwiGLU for the feed-forward activations. The MLA plus sparse-MoE combination is the DeepSeek-style template at this point; Moonshot running it at 1T total is a scale push on the same design language rather than a new recipe.

Benchmarks are the part to caveat. Moonshot claims the model matches or beats the frontier across more than two dozen benchmarks, but the one specific head-to-head number disclosed is HLE-Full: Kimi-K2.6 scores 54, Claude Opus 4.6 scores 53, GPT-5.4 scores 52.1. That's a win, but it's a one-point win on a single benchmark, and the rest of the claimed comparisons are qualitative in the source material. Context length, training token count, and training cost are not disclosed in the release we have. So: competitive on what we can see, insufficient data to confirm the full "matches or beats frontier" claim across the broader benchmark set. Independent evals on HumanEval, SWE-bench, GPQA, MATH, and AIME will sharpen the picture over the next two weeks.

If you are shipping long-context inference on a budget, the practical read is straightforward. The open-weights sparse-MoE-plus-MLA pattern from DeepSeek has now been validated at 1T total by a second Chinese lab, and the weights are downloadable today. That gives you a real option to compare against whatever closed frontier model you are currently paying for, with a serving profile designed from the ground up to keep active-parameter count and KV cache manageable. The longer-term pattern is the one to track: Moonshot, DeepSeek, Qwen, and GLM are shipping open-weights competitive models on a faster cadence than closed labs are shipping preview models, and the serving-infrastructure papers (PrfaaS this week, various Ring-attention and hybrid-attention papers earlier) show the same labs are also closing the inference-cost gap at the same time.

Moonshot ships Kimi-K2.6 with 1T params, MoE 384/8 routing, and MLA attention — HLE-Full edges Claude Opus 4.6 and GPT-5.4

More News