UK AISI: AI cyber time horizon doubling 4.7mo; Mythos/GPT-5.5 saturate benchmark, Zubnet AI News

The UK government's AI Security Institute (AISI) published updated cyber-capability tracking Thursday with numbers that revise the field's prior trajectory estimate. AISI measures frontier model cyber capability via "time horizon benchmarks" — how long an AI system can autonomously complete cybersecurity tasks compared to human experts. The February 2026 estimate puts the 80%-reliability cyber time horizon at doubling every 4.7 months since reasoning models emerged in late 2024, given a 2.5M token limit per task. The November 2025 estimate had been 8 months for both 50% and 80% reliability — so the doubling rate roughly halved over three months. Claude Mythos Preview and GPT-5.5 have since significantly outperformed even the revised 4.7-month trend; AISI explicitly flags the open question of whether this is "an isolated break from existing rates of progress or part of a new, faster trend." The honest framing matters: AISI is not declaring a new trend, only documenting that the most recent data is faster than even the recently-revised estimate.

The specific cyber-range results are what makes this concrete. Claude Mythos Preview became the first model to complete both of AISI's evaluated ranges. "The Last Ones" — a 32-step simulated corporate network attack — was solved 6 out of 10 attempts. "Cooling Tower" — a 7-step industrial control system attack, previously unsolved by any tested model — was solved 3 out of 10 attempts. GPT-5.5 completed "The Last Ones" 3 of 10 attempts but did not solve Cooling Tower in the reported runs. Both Mythos and GPT-5.5 achieved near-100% success rates on the longest tasks in the limited cyber test suite even with the 2.5M token cap applied. The Cooling Tower ICS result is the most operationally significant data point — until this round, the industrial-controls scenario had resisted every tested frontier model, and the 3/10 success rate from a single model crosses a defensive-planning threshold for any organization running OT systems. AISI's tracking is consistent with METR, the nonprofit research group whose AI software-engineering capability metric has doubled roughly every 4.2 months since late 2024.

The benchmark-saturation problem is the part to weight most carefully. AISI explicitly notes: "the latest frontier models are beginning to exceed the limits of the current cyber evaluation framework... once models consistently complete the most difficult tasks, the benchmark becomes harder to measure." Removing the 2.5M token cap would push success rates high enough that time horizon estimates "could no longer be calculated reliably." This is the harness-disclosure honesty CLAUDE.md prizes — the benchmark is approaching the regime where it no longer differentiates between models, and AISI is saying so. The corollary is that the next round of capability claims from frontier labs will need new evals or risk being meaningless; expect to see Mythos Preview and GPT-5.5 quoted as "100% on the AISI cyber suite" while the underlying differentiation is invisible. Pair this with VectorSmuggle research from yesterday (novel attack class on RAG infrastructure) and Microsoft MDASH last week (100+ agents finding Windows RCEs): the offensive capability is compounding across multiple measurement frames simultaneously.

For builders and defensive security teams: assume the 4.7-month doubling trajectory holds at minimum through Q3 2026, and treat the Mythos/GPT-5.5 outperformance as additional headroom. Concrete planning implications: (1) the time horizon that a single frontier model can autonomously sustain for multi-step intrusion operations is now measured in dozens-of-steps, not single-shot exploits — defensive monitoring built around point-in-time detection will continue losing ground; (2) the industrial-control-systems threshold (Cooling Tower) being crossed by one model means the same threshold will be crossed by others within 3-6 months on the current trajectory — OT/ICS security teams should be running their own internal AISI-style cyber-range evals against the models they expect to face; (3) the AISI cyber-range methodology itself is the part to lift — "did the model solve a 32-step corporate attack scenario" is a more useful eval than CTF aggregate scores for risk modeling. Watch for AISI's next quarterly update; if the 4.7-month doubling holds, the cyber time horizon at year-end is roughly 4× what it is now.

UK AISI: AI cyber time horizon doubling 4.7mo; Mythos/GPT-5.5 saturate benchmark

More News