URV's CoCoGraph generates 7.3M novel molecules with constrained diffusion, Zubnet AI News

Researchers at Spain's Universitat Rovira i Virgili shipped CoCoGraph, a discrete diffusion model that generates chemically valid molecules by learning how real molecules break apart and reassemble. It produced 8.2 million molecules — 7.3 million of which don't exist in PubChem — at 100% chemical validity and 96% novelty. Published peer-reviewed in Nature Machine Intelligence. For anyone working on drug discovery, materials science, or hard chemistry, this is the substance behind the "AI generates new compounds" headline category that's worth looking at closely.

CoCoGraph is built on constrained discrete diffusion: the model learns molecule structure by reversibly applying "double edge swapping" operations that preserve bonding requirements through the entire diffusion trajectory. Unlike unconstrained molecule generators that can produce chemically impossible structures — atoms with wrong valence, broken aromatic rings — and then need post-hoc filtering, CoCoGraph's constraint-during-diffusion design keeps every intermediate state a valid molecule. Numbers from the paper: 100% chemical validity, 99.8-99.9% uniqueness, 95.7% novelty against training data, GuacaMol KL divergence 95.7-96.3%, and a 62% pass rate on a human-expert test where chemists were asked to distinguish generated molecules from real ones (slightly better than chance, which here counts as the model successfully fooling experts most of the time). The full generation run produced 8.2M molecules, 7.3M of which aren't in PubChem. Authors: Roger Guimerà, Manuel Ruiz-Botella, Marta Sales-Pardo, Marta Sales.

Molecule generation has been an active ML target for half a decade — earlier work used graph neural networks (JT-VAE, MolGAN), more recent work moved to diffusion (GeoDiff, DiffSBDD). The validity-versus-novelty trade-off has been the open question: it's easier to generate things that look novel if you don't care about being chemically real, easier to generate chemically real molecules if you stay close to training data. CoCoGraph's constraint-during-diffusion approach hits both poles simultaneously — 100% valid AND 95.7% novel — which is the position labs have been chasing. The downstream implication: drug-discovery pipelines that previously gated AI-proposed molecules through expensive validity filters can pull that filter step in earlier, freeing screening capacity for synthesis-feasibility and target-binding evaluation. Materials science labs working on refrigerants, catalysts, polymers can apply the same shape to their domain.

Paper published in Nature Machine Intelligence (peer-reviewed, not preprint). Code/weights availability not stated in the summary — the DOI is where to confirm. For drug discovery and materials labs running internal molecule generation, this is worth checking against your current filtered-generation pipeline. For the broader audience: this is a real example of AI doing chemistry research that isn't a press-release demo, with peer review and explicit benchmark methodology behind it. The 7.3M novel molecules aren't drug candidates yet — they're a space to search inside — but searching a chemically valid 7.3M-molecule space is a measurable acceleration over what manual chemistry can do.

URV's CoCoGraph generates 7.3M novel molecules with constrained diffusion

More News