NVIDIA Labs released SpatialClaw this week, a training-free framework that rethinks how an agent acts when it reasons about space. Instead of choosing from a fixed menu of tools, the agent writes code. A vision-language model writes one executable Python cell per step into a stateful Jupyter kernel that comes pre-loaded with perception primitives, SAM3 for segmentation, Depth-Anything-3 for 3D reconstruction, geometry utilities, and scientific libraries like NumPy and SciPy. The agent runs the cell, looks at what comes back, writes the next one, and commits a final answer with a ReturnAnswer call. The repository's own framing is blunt about the thesis: rethinking the action interface for agentic spatial reasoning.
The shift in the action interface is the whole point. Most agents act through a fixed tool-calling schema, a predefined set of functions with structured arguments that the model selects from one at a time. SpatialClaw's argument is that code is a more expressive interface: a single cell can compose several tools together, inspect intermediate evidence like a depth map, a segmentation mask, or a measured distance, and revise the approach before answering, rather than locking into a plan up front. For spatial questions, where the answer usually depends on chaining perception steps and then doing geometry on the results, that flexibility is exactly what a rigid tool menu lacks.
The numbers back the design. Across 20 spatial reasoning benchmarks SpatialClaw reaches 59.9% average accuracy, an improvement of 11.2 points over the prior best spatial agent, and it gets there training-free, with no fine-tuning, by orchestrating off-the-shelf perception models under a VLM. NVIDIA tested six backbones across two model families, Qwen 3.5/3.6 and Gemma 4, ranging from 26 billion to 397 billion parameters, which suggests the gains are a property of the framework rather than one lucky model. The code is on GitHub under a non-commercial NVIDIA license.
The honest limits are the usual ones for this category. This is a benchmark result, and spatial-reasoning benchmarks are not the messy physical world a robot actually has to move through, so strong scores are a promise rather than proof of reliable behavior on hardware. Training-free also means the ceiling is set by the perception tools it wires together, not learned end to end. But the direction is what makes it worth noting, and it rhymes with where the field has moved all week: code as the universal action interface, the same instinct behind agents that write Python to get things done, and perception assembled from composable primitives instead of one monolithic model. SpatialClaw is a bet that for reasoning about the physical world, the most useful thing to hand an agent is not a bigger menu of tools, but a blank cell and a kernel already full of them.
