Liquid AI's new 350M retrieval models beat a rival nearly twice their size, and run on a laptop, Zubnet AI News

Liquid AI has released two new retrieval models, LFM2.5-Embedding-350M and LFM2.5-ColBERT-350M, and the headline is the size: at 350 million parameters each, both beat the larger Qwen3-Embedding-0.6B on multilingual search. They are the first bidirectional members of the LFM family, built from the LFM2.5-350M-Base checkpoint Liquid shipped in March, and they are tuned for fast multilingual and cross-lingual retrieval across 11 languages: Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Norwegian, Portuguese, and Swedish.

The two models take different routes to the same job. LFM2.5-Embedding-350M is a dense bi-encoder: it compresses an entire document into a single 1024-dimensional vector, which keeps the search index small and lookups cheap. LFM2.5-ColBERT-350M uses late interaction instead, keeping a separate 128-dimensional vector for every token and matching a query word by word at search time. That token-level matching tends to be more accurate and to generalize better to topics the model was not trained on, at the cost of a larger index. Having both in one family lets a team pick the trade-off that fits, or use the cheap bi-encoder to retrieve and the ColBERT model to rerank.

The numbers back the size claim. On NanoBEIR Multilingual, a retrieval benchmark scored by NDCG@10, the ColBERT model averages 0.605 and the embedding model 0.577 across the 11 languages, both ahead of Qwen3-Embedding-0.6B at 0.556, the previous LFM2-ColBERT-350M at 0.540, and Alibaba's gte-multilingual-base at 0.528. On MKQA-11, a cross-lingual question-answering test scored by Recall@20, the two land at 0.694 and 0.691, again above Qwen3 at 0.638. The wins are not blowouts, but a 350M model topping a 0.6B one on multilingual retrieval is the kind of result that matters when you are paying for every vector you store and serve.

Speed is the other half of the pitch. Liquid reports a query embedding in about 7.3 milliseconds at the median on a MacBook M4 Max CPU, and roughly 1.5 milliseconds on an H100 GPU. Both models support a 32,768-token context, tuned to 512 tokens for documents, and ship in standard and GGUF formats so they run under llama.cpp. As the company puts it, they are small enough to run almost anywhere. Both are available now on Hugging Face under the LFM Open License v1.0.

For anyone building retrieval, that combination is the interesting part. Search quality has usually climbed with model size and the bill that comes with it, so squeezing better multilingual retrieval into a model that fits on a phone or a single CPU points the other way: private, on-device, and cheap-to-serve search that does not call out to a hosted API. The caveats are worth stating plainly. These are retrieval and embedding models, not chat models; the benchmarks are NanoBEIR and MKQA rather than the full MTEB suite; and beating a 0.6B model is a real but narrow win, not a leap past the largest commercial embedders. Still, the direction is clear, and it is the direction small-model retrieval has been heading all year.

Liquid AI's new 350M retrieval models beat a rival nearly twice their size, and run on a laptop

More News