The “Benchmark Trap”: Why AI’s Math Scores Don’t Prove It Can Reason

This is a Plain English Papers summary of a research paper called First Proof. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

The benchmark trap

We live in an era of benchmark success. AI systems score impressively on math competitions, standardized tests, and published problem sets. The numbers look good. But there's a hidden cavity at the center of this progress, one that nobody talks about until you sit down and think carefully about what these benchmarks actually measure.

When you train a language model on internet-scale data, you absorb millions of competition problems, textbook exercises, and their solutions. You've seen the IMO problems from decades past. You've absorbed countless homework solutions. When an AI system then scores well on a public benchmark, the honest question becomes: is it reasoning, or is it retrieving? Pattern-matching is not mathematical thinking. It's mathematical recitation.

This distinction matters more than most discussions acknowledge. The gap between solving a published problem and advancing mathematics is exactly where AI's true limitations appear. Published problems are static artifacts. Research problems emerge in real time, shaped by the specific tensions and dead-ends that researchers encounter. They live in the space between known and unknown, and nobody has published the solution yet because the problem didn't exist publicly until the researcher's pen hit paper.

The authors of this paper recognized this crack in how we evaluate AI. Rather than construct another benchmark from published sources, they chose a different approach: take ten math questions that emerged naturally from their own research, encrypt the answers, and see what AI systems can do when they can't possibly have memorized the solution.

This is the core methodology. Not an algorithm, not a new mathematical result, but a way of asking the question that prevents the question itself from being gamed or contaminated by data leakage. The insight is simple, which is what makes it elegant.

The source

The credibility of any benchmark rests entirely on where the questions come from. If the problems are cherry-picked to be easy, or if they're artificially constructed to be maximally hard, the results tell you nothing about actual AI capabilities. They tell you about the benchmark designer's intuitions.

These ten problems come from somewhere different. They're not olympiad problems crafted over months for elegance and balance. They're not competition problems designed by committees to separate students at particular skill levels. They're the mathematical equivalent of a researcher's notebook, questions that emerged organically during active work.

One problem might arise when trying to prove a lemma and hitting an unexpected wall. Another might come from exploring a direction that seemed promising but turned into a dead-end, yet left an interesting open question in its wake. A third might emerge from noticing a pattern that doesn't fit existing theory. The messiness is the point. Real research mathematics feels messy because problems don't announce themselves neatly. They accrete gradually, shaped by the specific context of the researcher's work.

By tying each problem to the authors' actual research process rather than to a curated collection, the benchmark gains something precious: authenticity. These aren't problems designed to showcase either AI's strengths or its weaknesses. They're problems that mattered to the people asking them. That distinction creates a kind of natural representativeness that no amount of careful curation could achieve.

The encryption

Once you announce that these problems exist and that you know the answers, a timer starts. The moment those answers become public, the benchmark begins to deteriorate. Sophisticated actors could use the problems as targets for fine-tuning. More importantly, future large language models trained on internet data will absorb the published problems and solutions naturally, as part of their training corpus. Within a few years, any AI system tested on these problems might not be reasoning through them, but retrieving memorized answers.

The solution is encryption. The answers are locked away, inaccessible without keys that the authors control. This prevents premature leakage into training data during the evaluation phase. It keeps the benchmark itself from becoming contaminated by the process of testing it.

This might sound like basic operational security, and it is. But it serves a deeper methodological purpose. In the age of large-scale AI training, benchmarks face a unique vulnerability: they get absorbed into the very systems they're meant to evaluate. Encryption doesn't eliminate this problem entirely, but it buys time. It ensures that when different research teams eventually test their own systems against these problems, they're not all inheriting contamination from the first public evaluation.

The mechanism is simple, which makes it repeatable. Other researchers can use this same approach, building a library of encrypted benchmarks that serve as stable reference points for measuring progress over years.

What the patterns will reveal

When results eventually emerge, the temptation will be to reduce them to a single number: AI systems solved seven out of ten, or three out of ten, or whatever the outcome is. Resist that impulse. The useful information lives in the pattern of what gets solved and what doesn't.

Mathematical reasoning isn't monolithic. Some problems hinge on careful symbolic manipulation, the kind of step-by-step algebraic reasoning that AI systems often handle well. Others require a conceptual insight, a sudden recombination of ideas that leads somewhere new. Still others depend on resisting intuition, recognizing that the obvious approach is wrong and finding the less obvious one. A fourth category might require understanding subtle definitions or exploring edge cases systematically.

If an AI system solves all the computational problems but fails on the conceptual ones, that tells a story: the system is good at executing procedures but struggles with novel reasoning. If the opposite pattern emerges, that's a different finding entirely. If the system fails across the board, that's another story still. The specific geography of success and failure reveals the contours of the system's mathematical thinking.

The same applies to failed attempts. A system that reasons clearly all the way to a dead-end reveals something different from a system that goes off the rails immediately. The former shows some mathematical competence applied to a genuinely hard problem. The latter suggests the system is floundering at the conceptual level.

This is why the problems matter more in their structure than in their specific content. By sampling across the landscape of research mathematics, they create a distributed probe of AI capability. You're not asking a single question. You're asking ten different questions in ten different contexts, and the pattern of answers builds a richer picture than any single benchmark could.

The role of community

The paper doesn't position itself as a definitive statement. It's an opening move in a longer conversation about how to honestly evaluate AI progress in mathematics. By sharing the problems and encrypting the answers, the authors invite the research community to run their own tests, using the same benchmark across different AI systems and different research teams.

This is where the real value accrues. One team's evaluation of GPT-5 against these problems tells you something. Multiple teams testing multiple systems against the same problems tells you much more. The pattern of results emerges from the aggregate, not from any single data point.

This reflects good science: the research contribution is primarily methodological. The benchmark itself matters more than any particular evaluation run. By making both public, with answers protected, the authors enable reproducibility while preventing their own results from being the final word.

The encrypted answers serve another function here. They prevent the benchmark from being gradually absorbed into training data as a side effect of wide circulation. Years from now, when different researchers test new systems, they can still use these same ten problems with some confidence that the answers haven't already leaked into the models they're testing. The benchmark remains a stable reference point.

This echoes the broader challenge in AI evaluation: how do you measure progress honestly when the things you measure become part of what you train on? Encryption isn't a complete solution, but it's a practical one that other researchers can adopt and adapt.

The deeper question

Underneath this entire project is a recognition that AI evaluation has a credibility problem. When benchmarks are public and static, when they're built from published historical artifacts, when training data scales to encompass most of the internet, the traditional relationship between test and training breaks down. You can't evaluate whether an AI system can solve novel problems by testing it on published problems, because the published problems are no longer novel. They're part of the training data.

This paper offers one clean way forward: create benchmarks that emerge from ongoing research, encrypt the answers to prevent data leakage, and let them serve as stable reference points as the field evolves. It's a small move methodologically, but it addresses a real gap in how we currently think about measuring AI progress.

The value isn't in the specific ten problems or in whoever tests them first. The value is in having a shared instrument for asking the question year after year, watching how the answers change, maintaining the integrity of the signal even as everything else around it shifts. That's how you build honest measures of progress in a field moving as fast as this one.