Liquid AI released LFM2.5-VL-450M, upgrading their 450M-parameter vision-language model with object detection capabilities that score 81.28 on RefCOCO-M (up from zero) and expanded multilingual support across eight languages. The model maintains its edge deployment focus, running inference in under 250ms on hardware ranging from NVIDIA Jetson Orin modules to Samsung Galaxy S25 Ultra phones. Training scaled from 10T to 28T tokens with added preference optimization to improve instruction following and grounding accuracy.
This matters because most vision-language models require cloud infrastructure, creating latency and privacy issues for real-world applications like warehouse robotics or smart retail cameras. When I covered Liquid AI's 350M model last month, their hybrid architecture was already outperforming larger rivals. Adding object detection to a 450M model that runs locally changes the deployment calculus for computer vision applications that need both speed and structured outputs.
The technical details show thoughtful engineering choices: SigLIP2 vision encoder with 512Ã512 native resolution, thumbnail encoding for global context during image tiling, and tunable image token limits for speed-quality tradeoffs without retraining. Function calling support suggests they're targeting agentic workflows where vision feeds into structured actions. However, the 512Ã512 resolution limit and 32K context window constrain use cases compared to cloud-based alternatives.
For developers building vision applications, this represents a practical middle ground between capability and deployment constraints. The sub-250ms inference opens up interactive use cases, while bounding box prediction enables structured data extraction from image streams. The real test will be how it performs on domain-specific tasks after fine-tuning, especially given Liquid AI's claims about adaptation efficiency.
