The bottleneck in robot learning was never training โ€” it was evaluation. You train a policy, then you book the robot lab for a week to find out whether it beats the last one. That physical-world eval loop is why robot foundation models iterate slower than LLMs: an LLM gets SWE-Bench you can run in minutes, a robot policy gets "hundreds of hours of continuous hardware operation." Genesis AI's Genesis World 1.0, shipped May 27, attacks exactly this. The claim: a simulator faithful enough to evaluate policies in under half an hour, no human and no hardware in the loop, against more than 200 hours of real-robot operation for the same suite.

The headline number is a Pearson correlation of 0.8996 between sim and real rollouts (95% CI [0.744, 0.931]), but that's not the one a builder should fixate on. The number that matters is the Mean Maximum Rank Violation: 0.0166. An eval harness doesn't need perfect absolute fidelity โ€” it needs to rank your candidates the way reality would. MMRV 0.017 is the claim that when sim says policy A beats policy B, reality almost always agrees, across 3 model variants, 14 tasks, 200 episodes each, a million bootstrap iterations. The protocol is zero-shot real-to-sim โ€” policies trained only on real data, no simulated pretraining leaking into the eval. Under the hood: a unified multi-physics engine (rigid, FEM, MPM, SPH, PBD), Nyx, a path-traced renderer hitting noise-free 1080p in 4ms batched across thousands of parallel rollouts, and Quadrants, a Taichi fork compiling Python physics kernels to CUDA, ROCm, Metal and Vulkan with reverse-mode autodiff. The honest gaps: 14 tasks is narrow next to SWE-Bench's thousands, the robot embodiments aren't disclosed, the hardware behind "under 0.5 hours" is unstated, and โ€” the one that matters most โ€” contact-rich real-world correlation is never validated. They cite a 103x speedup on contact-heavy scenes but not that those scenes track reality. Contact and deformables are exactly where sim-to-real has always broken.

What this does to the ecosystem is the open-weights story applied to robotics measurement. The physics engine and Quadrants are Apache 2.0; Nyx ships as installable wheels. Just as open LLM eval harnesses let labs compete on models instead of on who controls the benchmark, an open sim-eval platform with credible ranking fidelity commoditizes the measurement layer for the embodiment race โ€” Physical Intelligence, Skild, Figure, every robot-foundation-model shop lives or dies on iteration speed, and iteration speed is gated by eval. The under-discussed piece is Nyx: most physics sims have weak rendering, and vision-based policies die on the perception gap, not just the dynamics gap. Pairing a real path-tracer with the physics โ€” and the claimed 45% reduction in reality gap by FID โ€” is a bet that closing the camera gap matters as much as closing the contact gap. Quadrants is independently useful too: multi-backend differentiable physics means you're not NVIDIA-locked for the compute, even if Nyx's renderer still is.

Monday morning, if you train robot policies: pip-install the Apache-2.0 engine and wire sim eval in as a ranking pre-filter that shrinks your real-hardware eval set โ€” but re-measure MMRV on your own task distribution before you trust it, because 14 tasks won't cover your manipulation and contact cases, and that's where the correlation is least proven. Treat it as a fast first pass, not a replacement for the robot. If you're not in robotics at all, Quadrants is the takeaway: a multi-backend Python-to-GPU compiler with autodiff across CUDA, ROCm, Metal and Vulkan, useful for any differentiable-simulation work, fully decoupled from the robot framing.