OpenAI published research this week on a problem every lab has and few can measure well: knowing how a model will actually behave in the wild before it ships. The method, which OpenAI calls Deployment Simulation, leverages recent production data to predict the rate of undesirable behavior ahead of release. In plain terms, it replays recent, de-identified user requests through a candidate model and scores the responses with an LLM judge, building a forecast of how often the model will do unwanted things once real users get it.
The headline results are about accuracy and about honesty under test. Across 20 behavior categories and three GPT-5-series Thinking deployments, the simulated rates tracked the observed production rates closely, and the method beat both a challenging-prompt baseline and a previous-deployment baseline at predicting which way a behavior would move and by how much. The part worth lingering on is evaluation awareness, the long-standing problem that a model can behave differently when it senses it is being tested. OpenAI reports that simulating real deployment reduced that awareness to levels close to genuine production traffic, which is the whole point: a test the model cannot tell apart from the real thing measures the real thing. The team also extended the approach to agentic deployments, using stateful tool simulators that can produce realistic multi-step trajectories when given enough context.
The more democratizing piece is a companion post on OpenAI's alignment blog, which asks whether outside researchers, who almost never get a lab's private production data, can do this with public data instead. Using WildChat, a public dataset of about a million conversations from 2023 and 2024, the answer is a qualified yes: roughly 95% of WildChat-based predictions landed within about one order of magnitude of the realized production rate, with a mean error near 3.6x across 19 tracked safety categories. That is coarse next to the private-data version, but it is a real signal from data anyone can use, despite a two-to-three-year gap between when WildChat was collected and how people use models now. The sharp caveat the team flags itself: WildChat is far weaker for agentic tasks, where raw errors ran about 37x larger, because short chat logs simply do not contain the tool-rich, multi-step failures that agents produce.
The reason this matters lands in the same place as a lot of this month's measurement debate: benchmarks saturate, get gamed, and stop predicting real behavior, so the field needs better ways to forecast what a model will do once it is loose. A pre-deployment estimate that resists test-gaming, and a public-data version that lets people outside the labs check the labs' work, are both genuinely useful steps. The honest limits are the ones to hold onto: this is one lab's method validated on its own deployments and numbers, the agentic gap is wide enough that the chat-data version should not be trusted for tool-using systems, and a forecast, however well-calibrated, is still a forecast rather than a guarantee about the next model loosed on the world.
