Liquid AI LFM2.5-8B-A1B: on-device MoE, 1.5B active, 253 tok/s on M5 Max CPU

Liquid AI released LFM2.5-8B-A1B, an open-weight Mixture-of-Experts model that activates only 1.5B of its 8.3B total parameters per token. The number that matters for builders is the on-device throughput: 253 tokens/sec on an M5 Max laptop CPU under 6GB of memory, ~30 tokens/sec on mobile, and 18.5K tokens/sec on an H100 (over 1.6B tokens/day at high concurrency). This is the deployment-economics move — you pay 1.5B-active inference cost while drawing on an 8.3B parameter knowledge pool, on hardware that fits in a backpack. Weights are on HuggingFace under the LFM1.0 license with base and post-trained checkpoints, runnable today on llama.cpp, MLX, vLLM, and SGLang.

The architecture is a hybrid, not a vanilla MoE transformer. Of 24 layers, 18 are double-gated LIV convolution blocks and 6 are grouped-query attention layers, with MoE routing layered on top — the conv-heavy design is what keeps the active-param cost and memory footprint low enough for edge. Context window doubled to 131,072 from the predecessor's 32K; vocabulary grew to 128K tokens with compression gains tuned for Hindi, Thai, Vietnamese, Indonesian, and Arabic. Benchmark jumps over LFM2-8B-A1B are large: IFEval 79.44 → 91.84 (matching Gemma-4-26B despite far fewer active params), MATH500 74.80 → 88.76, AA-Omniscience non-hallucination rate 7.46 → 63.47, Tau² Telecom 13.60 → 88.07. The honest limitations are stated by Liquid: the small active-param count caps knowledge capacity, so it is not suited for heavy programming or knowledge-intensive work without retrieval augmentation, and it is text-only — no vision or audio.

The ecosystem read: MoE-on-edge is now a real category distinct from dense small models. Qwen, Gemma, and Phi compete in dense sub-10B; LFM2.5-8B-A1B's bet is that sparse activation gives you a higher quality ceiling at the same inference cost, which is the right tradeoff specifically for on-device where memory bandwidth, not compute, is the binding constraint. The 1.5B-active number is what lets it run on a phone at usable speed — a dense 8.3B model would not. For the agent stack, an on-device model with tool calling and 128K context changes the architecture of what can run without a cloud round-trip: local agents that read long documents, call tools, and reason, with the cloud reserved for the knowledge-heavy calls the model itself flags as out of its depth (which is what the non-hallucination jump to 63.47 is really measuring — the model knowing when it does not know).

If you ship edge or on-device AI Monday morning: the 253-tok/s-on-laptop-CPU and ~30-tok/s-on-mobile numbers are the ones to benchmark against your own target hardware, and the LFM1.0 license is the thing to read before assuming commercial use. If you build agent infra: pair this with a RAG layer for the knowledge tasks it flags as out of depth, and you have a local-first agent that only hits the cloud when it has to. The structural news is that sparse on-device beat dense on-device on the quality-per-active-param frontier — watch whether Qwen and Gemma follow with MoE edge variants.

Liquid AI LFM2.5-8B-A1B: on-device MoE, 1.5B active, 253 tok/s on M5 Max CPU

More News