Poetiq โ an unidentified organization with a website at poetiq.ai โ published results claiming that a "Meta-System" can automatically construct inference harnesses that improve any LLM's performance on coding benchmarks without fine-tuning or model-internal access. The reported numbers on LiveCodeBench Pro are sharp: Gemini 3.1 Pro climbs from 78.6% to 90.9%, GPT-5.5 High from 89.6% to 93.9%, Kimi K2.6 from 50.0% to 79.9% (roughly +30 percentage points), Gemini 3.0 Flash from 72.3% to 82.3%, and Nemotron 3 Super 120B by +12.8pp. The harness was optimized on Gemini 3.1 Pro only and applied unchanged to the other models. If those numbers replicate, it is a meaningful inference-time gain, especially the Kimi K2.6 result on a competitive-programming-style benchmark.
The mechanism is where the claim gets thin in public form. The blog frames the meta-system as "building task-specific harnesses through recursive self-improvement" by "developing better strategies for determining what to ask, refining sequential chain-of-questions, and devising new methods for assembling answers." That is shape rather than spec. No step-by-step algorithm is published, no arXiv preprint ID is provided, no GitHub repository is named, and the harness itself does not appear to be open source. The article links to a Poetiq post at poetiq.ai/posts/recursive_self_improvement_coding/ for technical details, but the disclosure level there determines whether this is a reproducible result or a vendor claim. The pattern for inference-time-gains research over the past two years has been that the headline numbers usually hold but at lower magnitudes once a third party reproduces with the same harness on a clean run.
LiveCodeBench Pro is the right benchmark choice for this kind of claim because it is designed against the two common failure modes โ data contamination and overfitting โ through C++ competitive programming tasks and continuous updates. That helps. But harness optimization on LCB Pro can still overfit to LCB Pro: the meta-system was trained to maximize score on this exact eval, even if no individual problem leaked. The Kimi K2.6 jump from 50% to 80% is the kind of swing where you want to ask whether the harness encodes structural knowledge of the benchmark format (input/output shape, sample test runners, retry-on-failure loops) versus genuinely generalizable reasoning support. Without the harness in the open, that question cannot be answered.
For builders: bookmark this and wait. If Poetiq publishes the harness or the meta-system, the +30pp Kimi K2.6 result is worth running on your own coding evals before you change anything. If they publish only a paper without code, treat it as a hypothesis until somebody else replicates. The substantive question โ "can prompt and harness engineering at this depth produce ~10-30pp gains across heterogeneous models with no per-model retuning" โ is one of the higher-value open questions in the agentic coding space right now, and the answer to it is worth more than any single benchmark number.
