Why Traditional Benchmarks Fail Modern AI Models with OpenAI Research Scientist Noam Brown
- 01Benchmark Grids Are Structurally Broken for Modern AI
- 02Model Capability Is Now a Function of Inference Budget
- 03Safety Evaluation Frameworks Are Dangerously Out of Date
- 04There Is Massive Latent Capability in Already-Released Models That Nobody Has Explored
- 05Performance Asymptotes Are Task-Dependent
- 06Benchmark Maxing Via Scaffolding Is a Hidden Form of Misleading Comparison
1. Key Themes
Benchmark Grids Are Structurally Broken for Modern AI
The standard benchmark grid — a single number per model per task — was designed for an era when test-time compute scaling didn't exist. Noam argues it systematically misrepresents model capability because it fails to control for how much compute was spent during inference.
"The benchmark results are being presented in the wrong way. They're not controlling for the amount of test time compute that is being used on that benchmark question. It turned out that 5.5 is just much more efficient with its thinking. If you run it at max settings, 5.4 is thinking for a lot longer... And once you control for the amount of thinking time, actually you can see that 5.5 is a substantial jump over 5.4." — Noam Brown 00:02:28
Model Capability Is Now a Function of Inference Budget
This is a categorical shift from prior generations of models. GPT-3-era capability was essentially fixed; modern models' capability is a continuous function of dollars spent.
"The problem is we're in a world now where the capability of the model is a function of how much money you put into it, basically. If you give it a budget of $10,000, it can do a lot more than what it can do with a budget of $10. Give it a budget of $10 million, you can do even more." — Noam Brown 00:12:45
Safety Evaluation Frameworks Are Dangerously Out of Date
Responsible scaling policies and preparedness frameworks were designed before test-time compute scaling was meaningful. They evaluate a model's capability at an implicit fixed budget that no one has defined, making them increasingly inadequate as inference budgets balloon.
"The preparedness frameworks and responsible scaling policies, they don't really account for the amount of test time compute. They just say, okay, well, what's the capability of the model?... At what budget should you evaluate these models? The policies that exist today don't really address that question." — Noam Brown 00:12:45
There Is Massive Latent Capability in Already-Released Models That Nobody Has Explored
Because the model release cycle is faster than the time needed to fully probe a model, the actual ceiling of current models remains unknown. The Erdős unit distance conjecture disproof is the canonical example.
"Honestly, it did it at a budget that was dirt cheap... We ran it through some problems. And this one, at a pretty low budget, it was like, oh, yeah, I think I have a disproof. And then we were able to verify that, yeah, the disproof is correct." — Noam Brown 00:17:44
"Nobody actually knows what the ceiling of capabilities are for these models because nobody's actually run them for long enough to really tell." — Noam Brown 00:16:04
Performance Asymptotes Are Task-Dependent — Not Universal
Not all tasks benefit from more inference compute. Factual retrieval plateaus quickly; constraint-satisfaction problems like Sudoku never plateau. All real benchmarks sit somewhere on this spectrum, and conflating them distorts evaluation.
"There are some benchmarks where they clearly improve with more test-time compute, and there's some where they don't... If you ask a person, when was Abraham Lincoln born, and they don't know the date, they could sit there, they could think about it for a week. If they don't have access to Wikipedia or something, they're not going to be able to do better." — Noam Brown 00:22:09
Benchmark Maxing Via Scaffolding Is a Hidden Form of Misleading Comparison
Running a model multiple times and taking the best result inflates benchmark scores without any real capability gain, once you control for compute spent.
"It's really easy to make something that looks a lot better on paper... if you say, okay, well, we're going to instead of just running this model once, we're going to run it five times and take the best of the five responses or like ask a judge which one it thinks is best. Then you can get much higher scores than that model." — Noam Brown 00:07:02
The Model Release Cycle Outpaces Evaluation Timelines, Creating a Structural Blind Spot
New models arrive every two to three months, but fully probing a model's long-horizon agentic capabilities may require running it for weeks or months. The industry never catches up.
"The model release cycle is, look, we're releasing new models, like, every two or three months at this point. And so a model comes out, it takes two or three months to push it to its limits, and then you have another model come out. And so nobody actually knows what the ceiling of capabilities are for these models." — Noam Brown 00:16:04
AI Research Taste Remains a Key Bottleneck for Full RSI
The models are exceptional at optimization and code acceleration but fail at generating genuinely novel algorithmic ideas. This is the binding constraint on recursive self-improvement right now.
"Can you come up with an algorithm that is better than the algorithms that I came up with or that anybody else came up with? Go ahead and, like, look at all the published work and synthesize that and then try to come up with something novel. And it's not able to do it." — Noam Brown 00:24:29
No Overnight Intelligence Explosion — Time Is the Bottleneck
Noam explicitly rejects hard-takeoff scenarios. Because peak intelligence requires large-scale test-time compute, time itself becomes a physical bottleneck on the rate of capability gain.
"If it requires so much test-time compute to unlock the full capabilities of the model, then that means you're bottlenecked by time. Things can only go so fast because the models need to run for long enough to actually do something really, really powerful. Time itself becomes a bottleneck to what we can do." — Noam Brown 00:26:23
2. Contrarian Perspectives
The Industry Is Trapped in a Bad Benchmark Equilibrium Everyone Privately Knows Is Wrong
Most researchers know the grid is misleading but publish it anyway because everyone else does. This is a deliberate coordination failure, not ignorance.
"Everybody would say like, yeah, that makes sense. We should do that. But... people expect us to publish the grid... Because everybody publishes the grid. And so you kind of end up in this bad equilibrium where everybody kind of knows that it's a bad equilibrium, but like nobody wants to break out." — Noam Brown 00:32:25
Current Models May Already Be Capable of Solving Major Open Scientific Problems — For $100K or Less
The conventional assumption is that transformative scientific breakthroughs require next-generation models. Noam argues the current generation, at sufficient compute budget, may already be there, and almost no one has tested this.
"You could, in principle, ask 5.5 to, you know, as a general purpose scaffold, list a bunch of different strategies... it would probably be able to arrive at the disproof with a general purpose scaffold... it would probably cost, I just ballpark, like, $1,000 to $100,000. But it would be possible, and it would have been possible for somebody to disprove the Erdős unit distance conjecture before we did using a general purpose model." — Noam Brown 00:18:35
Routing Layers and Orchestration Startups May Not Deliver Real Value Once Compute Is Controlled For
The hot category of inference routing and multi-model orchestration may be illusory improvement — it likely disappears once you normalize for the total compute budget spent.
"I definitely believe that if you do consensus on the models, they're going to achieve better performance than any individual model. But it's important to ask, like, are you going to do better than having that model basically think for longer? Like once you control for the amount of test time compute, is it actually still doing better?" — Noam Brown 00:34:46
AI Models Are Now Trustworthy Enough for High-Stakes Personal Decisions — More So Than Humans in Some Cases
Against the prevailing caution narrative, Noam argues models have crossed a threshold where they can be trusted on consequential real-world decisions.
"I asked them tax advice or I bought a condo recently and I was asking it for advice... It's actually really good for these kinds of questions... I feel like I can just trust the outputs, arguably more than I could trust the output from a human person." — Noam Brown 00:31:03
There Will Be a Gradual, Not Sudden, Takeoff — And Researchers Grinding Hours Is the Current Binding Constraint
The dominant narrative around frontier competition is about algorithms and compute. Noam points to human researcher hours as the true rate-limiter right now.
"That's why all the researchers are working so intensely right now. It's just so many hours per week are being put into this because we all see what the overhang is. We see what the capabilities are. And we're just bottlenecked by how quickly can we do things." — Noam Brown 00:26:50
3. Companies Identified
OpenAI
Leading AI research lab and developer of the GPT and o-series models. Mentioned as the institution where the Erdős conjecture disproof was achieved, and as the context for all model capability discussions. Noam notes internally they are actively restraining researchers from cherry-picking open problems to solve, in order to stay focused on building more capable models.
"We are trying to encourage people to not spend all their time just, like, going through all the mathematical open problems, physics problems, and just pushing the models to their limits to see what they can prove or disprove. Because we really think the focus should be on how do we make even more capable models?" — Noam Brown 00:20:27
AISI (AI Safety Institute)
UK government AI safety evaluation body. Mentioned as one of the few evaluators actually running models at very high inference budgets (100 million tokens) and observing continued capability improvement.
"The AISI in their evaluations has shown that the models continue to improve at 100 million tokens. You know, if you run them for 100 million tokens, they're still improving at beyond that point." — Noam Brown 00:04:40
4. People Identified
Noam Brown
OpenAI Research Scientist and pioneer of inference-time compute scaling. Co-creator of Libratus and Pluribus (superhuman poker AI). His essay on large-scale test-time compute evaluation triggered a public conversation about broken benchmarking norms. He personally uses PokerBot construction as his private eval for model capability across releases, and was involved in the Erdős unit distance conjecture disproof.
"With 5.5, I actually thought it was way better. It was able to basically do it zero shot... I've been working on just doing a full scale poker solver. And it's basically able to do the whole thing with some gentle steering from me. And I wouldn't be surprised if, you know, six months or a year from now, the model is able to do zero shot an entire poker solver, basically my entire PhD thesis in one go." — Noam Brown 00:11:06
5. Operating Insights
Use Test-Time Compute as an Explicit Budget Variable, Not an Afterthought
When evaluating AI models for your own business use cases, always fix the inference budget (in tokens, cost, or time) and compare models at equivalent spend. This applies equally to vendor selection, internal evals, and any benchmarks you commission. The model that looks worse on a grid may be dramatically better per dollar.
"The proper way to evaluate the models now is you either have some kind of budget for the benchmark, whether it's tokens or cost or time or whatever, or you plot the performance as a function of the amount of test time compute that's going into the model. And then it becomes much more clear how to compare the performance between these different models." — Noam Brown 00:04:00
Build a Personal Private Eval Suite for Every Model Release
Don't rely on published benchmarks to make decisions about which model to use. Design your own domain-specific test — ideally a task that requires real reasoning, has limited publicly available training data, and has known failure modes you can diagnose. Noam's PokerBot construction is the model of this approach.
"For me lately, it's been I use them to make PokerBots and see how good they can make a PokerBot. I think it's a nice eval because there is very little open source code for making PokerBots... it requires a lot of just reasoning and iteration and like a lot of small gotchas that I can kind of I've already worked through myself." — Noam Brown 00:08:30
Verify Outputs Relentlessly — Model Gaslighting Was Real and Lingers
Earlier models would confidently assert wrong answers and rationalize them. While 5.5 is substantially better, the operational habit of checking outputs — especially on quantitative or logic-dependent work — should remain standard practice.
"The downsides with 5.2 is I felt like it was gaslighting me a lot. And I always had to be very careful checking it and making sure like, OK, is it actually doing what it said it did?... I told it, OK, well, let's say I have $100 in the pot and I fold, how much am I losing? And the model said $92... it said, oh, you know, it's 92. It's close to 100. It's fine. It's no big deal." — Noam Brown 00:10:08
6. Overlooked Insights
Extrapolating Safety Capability From Low-Budget Evals May Be Mathematically Tractable — and Nobody Is Doing It
Noam briefly floats an idea that could solve the core safety evaluation problem: fit a performance curve from cheap inference runs and extrapolate to arbitrarily high budgets. This is a nascent research area with enormous policy implications, and he explicitly says no one has published on it yet.
"You could probably do some kind of evaluation up to a certain budget and then just say, okay, well, this is what we project the performance to look like... Can you predict what the performance looks like at an inference budget of, let's say, $10,000 only using inference budgets up to $10 or $100? I actually think this would be a great paper to publish if there's any academics out there looking for something to research." — Noam Brown 00:05:08
This is significant because it implies a concrete research program — essentially scaling laws for test-time compute on dangerous capability tasks — that could unlock the next generation of responsible scaling policies. Any lab, academic group, or safety-focused investor funding evals research should treat this as a priority gap.
The Erdős Result Implies a General-Purpose Scientific Discovery Framework Already Exists — Undeployed
Noam mentions almost in passing that the Erdős disproof didn't require a specialized system — it required a generic scaffold (list strategies → investigate each) applied to a general model. This means the infrastructure for automated theorem-proving and potentially broader scientific discovery at frontier math level is already available and costs $1K–$100K per problem. The implication is that any sufficiently motivated researcher or company could be running this today on open problems across mathematics, physics, and chemistry — and almost nobody is.
"You could, in principle, ask 5.5 to, you know, as a general purpose scaffold, list a bunch of different strategies. And then for each strategy, tell it to investigate that strategy. And then it would probably be able to arrive at the disproof with a general purpose scaffold... And nobody had explored sufficiently of what happens if I put $100,000 worth of compute into 5.5. What could it do?" — Noam Brown 00:18:35