Meta's Sapiens2 ships 1K-native human vision transformers up to 4K, with open weights and a +24 mIoU jump on body-part segmentation

Meta FAIR released Sapiens2 this week — paper accepted to ICLR 2026, weights on GitHub at facebookresearch/sapiens2 — and the headline feature is that the entire family now runs at 1K resolution natively, with a 1B-parameter hierarchical variant trained at 4096×3072. Most prior human vision foundation models cap at 256 or 512 because the compute and data costs of going higher are punishing. The Sapiens2 team trained on a curated dataset of 1 billion human images (up from roughly 300 million in Sapiens v1) and used a combination of masked image reconstruction with self-distilled contrastive objectives to learn both low-level detail and high-level semantics in the same backbone. The model family spans 0.4B to 5B parameters, all using patch size 16, with the base size trained at 1024×768.

The output set is what makes this useful for actual production work, not just paper benchmarks. A single Sapiens2 model produces pose estimation, body-part segmentation, surface normals, pointmap (3D reconstruction primitive), and albedo (intrinsic surface color, decoupled from lighting). That last pair is new compared to Sapiens v1, and pointmap+albedo together are the primitives you need for relightable 3D human avatars — which is where the model lineage feeds into Meta's Codec Avatars work. The benchmarks against v1 are not modest: +4 mAP on pose, +24.3 mIoU on body-part segmentation, and 45.6% lower angular error on normal estimation. A 24-point improvement on segmentation mIoU is the kind of jump that obsoletes the previous generation rather than incrementing it.

The strategic read is that Meta is positioning this as the open-weights answer to the proprietary mocap and avatar pipelines that have dominated the AR/VR and visual-effects industries. Most existing human vision stacks at this quality level are built on closed datasets and licensed components — Vicon, Marker.io, the various body-tracking SDKs — and Sapiens2 ships the weights publicly under a permissive license consistent with prior FAIR releases. For a small studio or a research lab that previously needed to license a body-tracking SDK or train a proprietary stack, the calculus has shifted. The model is not magic; it still needs cleanup for production mocap, calibration for specific cameras, and rigging work to drive avatars, but the foundation layer that used to cost real money is now downloadable.

For builders working on human-centric vision — VR/AR, fitness tech, sports analytics, telepresence, photogrammetry, virtual try-on, motion capture pipelines — Sapiens2 is worth a serious evaluation. The 1K and 4K resolution variants are the headline; the multi-task single-model architecture is the practical productivity gain because you get pose, segmentation, normals, and 3D primitives from one inference pass instead of five. The open-weights release means you can fine-tune on your specific application, body type distributions, or lighting conditions without going through a vendor licensing cycle. The honest caveats are that the 5B-parameter top-end variant is heavy enough to need real GPU infrastructure to serve at video frame rates, and the 1B-image training set has its own demographic distribution that affects fairness on edge cases — Meta has not yet published the demographic breakdown and the field's prior tracking research suggests the long tail is where these models still fail. Run your own evaluation set before deploying.

Meta's Sapiens2 ships 1K-native human vision transformers up to 4K, with open weights and a +24 mIoU jump on body-part segmentation

More News