Ai2 ships MolmoAct 2: rebuilt VLA, 720h bimanual data, 37x faster than v1

The Allen Institute (Ai2) released MolmoAct 2 today, a ground-up rebuild of their open-source vision-language-action (VLA) foundation model. Headline numbers: built on Molmo 2-ER (the embodied-reasoning Molmo variant trained on 3M image-grounded reasoning examples), supplemented with a new MolmoAct 2-Bimanual YAM dataset of 720+ hours of dual-armed robot trajectories, language annotations expanded from 71k unique labels to ~146k, and 37x speedup on real-world tasks vs MolmoAct v1. Real-world validation is at Stanford's Cong Lab on CRISPR-related lab work. Models are open foundation; training code release is planned.

The architectural lineage matters here. MolmoAct's original move — and what differentiates it from text-token VLAs like RT-2 or OpenVLA — is grounding scene semantics through depth-aware perception tokens rather than language tokens. The model runs three autoregressive stages: spatially-grounded perception tokens (extracted with a VQVAE, encoding geometric structure via depth and positional embeddings), waypoints in image space sketching the visual reasoning trace, then low-level action commands for the hardware. MolmoAct v1 hit 72.1% out-of-distribution success on its eval, beating Physical Intelligence, Google, Microsoft, and NVIDIA closed VLAs. v2 keeps the depth-token approach but adds a dedicated "action expert" doing 3D reasoning natively, and the bimanual training data closes the gap to humanoid-class manipulation tasks where two-arm coordination is the actual hard part. The 37x speedup claim needs context — Ai2 hasn't disclosed whether that's inference latency, planning throughput, or end-to-end task completion, and which baseline (the v1 evaluation harness or a comparable closed VLA) is the divisor.

The ecosystem read: Ai2 is the open-source counterweight in the VLA race that's increasingly closed. Physical Intelligence's π0/π0.5, Figure's Helix, NVIDIA's Groot N1, Google's RT-2 sit behind walls or selective licensing. MolmoAct 2 is the only fully-open foundation in this generation that ships actual policies you can fine-tune for your robot stack — and the bimanual dataset alone is more than most open robotics datasets carry. For builders training their own robot policies, this changes the math: previously the choice was between an open base lacking dexterous manipulation data (Octo, OpenVLA, RDT) or a closed checkpoint you couldn't extend. With MolmoAct 2 plus the YAM dataset, the open path now includes the data scale the closed labs were betting builders couldn't reach. The proprietary VLA labs are about to find out how their moats hold against an open foundation that's been rebuilt explicitly to compete with them.

Practical move: if you're training robot policies on dual-armed hardware, MolmoAct 2-Bimanual YAM is worth a download once it lands. Pretraining on Molmo 2-ER's 3M-example base means the perception side is solid before you even touch your task-specific data. If you're doing single-arm work, the perception-token architecture transfers, but you'll be replicating less of the bimanual gain. The eval boundary to watch: Ai2 hasn't published comparison numbers against π0.5, Helix, or Groot N1 — those comparisons will emerge from independent benchmarks over the next month, and that's where the actual frontier read settles. The 37x speedup is the headline; the real question is what happens to that number when you put MolmoAct 2 head-to-head with the closed VLAs on the same task suite. For now, builders get an open foundation that didn't exist three days ago.

Ai2 ships MolmoAct 2: rebuilt VLA, 720h bimanual data, 37x faster than v1

More News