Lost Proofs, Found Proofs: How AI Unearthed Forgotten Solutions

This is a Plain English Papers summary of a research paper called Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the ErdH{o}s Problems. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

The problem with lost mathematics

Mathematicians have accumulated hundreds of fascinating open problems over decades. Some of these problems sit unsolved because they're genuinely hard. Others, it turns out, aren't actually unsolved at all. They're just forgotten.

This distinction matters more than it might seem. When a problem is marked "open" in an authoritative database like Bloom's Erdős Problems, researchers treat it as a frontier worth exploring. They allocate attention and effort accordingly. But what if the problem was actually solved years ago, just in a paper that never connected with the original problem source? What if the solution exists somewhere in the sprawling literature but never reached the attention of whoever maintains the database?

This is a problem of visibility rather than difficulty. And it's a problem that AI might help solve.

A case study examining 700 problems marked "open" in Bloom's database found something striking: at least 8 of 13 problems that received meaningful resolution weren't genuinely open at all. They had already been solved. The bottleneck wasn't intellectual firepower. It was attention.

Building the hybrid pipeline

Rather than asking AI to solve problems autonomously, the researchers designed something more modest and more realistic: a two-stage screening process that plays to the strengths of both machines and humans.

The first stage uses Gemini, Google's language model, to examine each conjecture. The system attempts to verify or refute problems, find related literature, and generate structured summaries of status. This stage is fast. It can process hundreds of problems in parallel, exploring connections that might take humans weeks to compile.

But speed creates a tradeoff. Language models generate plausible-sounding text whether or not it's correct. They find papers that seem related without always understanding the connections deeply. So the second stage matters more: human mathematicians carefully examine any problem where the AI suggested a potential resolution or found relevant literature. This is where credibility gets established.

The combination works because it recognizes a fundamental asymmetry. AI is good at parallel exploration and retrieval at scale. Humans are good at rigorous verification and creative judgment. Neither alone solves the discovery problem well. Together, they amplify each other's capabilities while compensating for each other's weaknesses.

What the search actually uncovered

Out of 700 problems evaluated, 13 received meaningful resolution. Of these, 5 appeared to be solved through novel autonomous solutions generated by Gemini. The other 8 were already solved, but the solutions existed in the literature without the database knowing about them.

This distribution tells a story. It means the "open" status of these problems stemmed from obscurity rather than difficulty. Someone had published a solution. They just published it without the problem being updated in the official database.

Erdős problems evaluation pipeline

The evaluation pipeline shows how 700 problems flow through AI verification to human expert assessment, with detailed tracking of resolution types and verification results

The finding reframes what "open" actually means. It's not a ranking by intellectual hardness. It's a ranking by visibility and knowledge organization. This has important implications for how researchers choose what to work on and how we think about the frontier of mathematical knowledge.

The credibility problem

Here's where the story gets complicated. The discoveries are real, but they expose something troubling about AI-assisted mathematics: it has failure modes that don't exist in traditional mathematical work.

When Gemini was asked to find relevant existing literature, it identified papers that seemed connected to various conjectures. But human verification revealed something uncomfortable. Some connections were superficial. Some were wrong in subtle ways. A paper might address a related problem in a completely different context, creating false confidence that a solution had been found.

This wouldn't be a minor issue at scale. When you're processing 700 problems, small error rates compound. An AI that's 95% accurate at literature retrieval might still produce dozens of false positives that waste expert time.

The second problem cuts deeper: subconscious plagiarism. When Gemini generated "novel solutions" to problems, some solutions didn't create genuinely new work. Instead, they recombined patterns from existing work in ways that seemed original without directly copying anything.

This is harder to explain but worth understanding carefully. A language model is trained on millions of papers and texts. When it generates a solution, it's operating in a space shaped by all that training data. A novel-sounding solution might actually be a recombination of existing patterns the model absorbed during training. It's not plagiarism in the intentional sense, but it's not quite genuine discovery either.

Why does this matter? Because mathematical credibility depends on genuine novelty. If an AI's solution is actually a new arrangement of existing ideas scattered across its training corpus, then the verification problem becomes even harder. You'd need to audit not just whether the solution is correct, but whether it's authentically new.

The verification problem itself turned out to be the hardest part of the pipeline. This inverts what many expect from AI. We assume machines should excel at checking and validation. But mathematical verification often requires deep contextual understanding and creative insight. It's harder to scale than generating candidates.

What this reveals about mathematics and machines

These findings point toward a clearer picture of how AI and mathematics actually work together, and it's both more limited and more useful than either pure optimists or skeptics might expect.

First, mathematics is not purely about solving problems. It's also about communication and visibility. The discovery that many "unsolved" problems were already solved suggests our field's knowledge is more fragmented than we'd like to admit. A solution that exists but isn't known is practically almost as useful as no solution at all. This means AI's role in mathematics includes something mundane but essential: helping us organize and retrieve what we already know.

Second, the hybrid model reveals something important about different types of cognitive work. AI excels at exploring large spaces of possibilities in parallel. It's good at retrieving related papers and generating candidate solutions. But the gatekeeper role, the moment where mathematical work becomes credible and trustworthy, still requires human expertise. The sustainable model isn't AI as independent discoverer. It's AI as a tool that expands the problem space humans can carefully examine.

Third, the paper's honesty about failure modes matters. Instead of claiming the approach solved everything, the researchers openly discussed where AI falters: hallucinations in literature identification, the risk of subconscious plagiarism, the difficulty of rigorous verification at scale. This transparency is more useful than any success claim because it tells future researchers where the real challenges lie.

The work connects to a broader exploration of how AI and human expertise combine in mathematical discovery. Other researchers have investigated interactive theorem proving with large language models and AI as a partner in mathematical advancement, finding similar patterns: AI can expand the space of possibilities, but human judgment remains essential for verification and directing effort.

Moving forward

The real insight from this case study isn't that AI discovered lost mathematical progress, though it did help identify some. The insight is that our field's knowledge has always been somewhat fragmented, and tools that help us systematically search, organize, and retrieve what we already know might be as valuable as tools that generate genuinely novel ideas.

Mathematics is not a competition between human and machine intelligence. It's a collaborative space where different strengths matter: machine speed and parallel exploration in generating candidates, human creativity and rigorous judgment in verification and direction.

The papers that best integrate AI and human expertise will recognize this boundary clearly. AI expands what humans can explore. Humans provide the mathematical credibility that comes only from careful, context-aware reasoning. Building tools that respect this division, rather than trying to automate one side completely, is likely where the actual progress lies.