AIRS-Bench vs. AI Hype: Can Agents Actually Do Science?

The gap between AI hype and reality

There's a familiar pattern to how we talk about AI breakthroughs. An agent solves a coding challenge. An AI beats a benchmark. Researchers publish results. The demos look impressive. But then you try to use these systems for something that matters, something messy, something real. And they break down in ways the benchmarks didn't capture.

This is the gap that AIRS-Bench addresses. The benchmark asks a simple but uncomfortable question: can AI agents actually do science? Not the flashy parts like discovering a formula or writing elegant code once. The grinding, iterative work. Running an experiment. Staring at confusing results. Adjusting your approach because something didn't work. Trying again. Doing this dozens of times until you have something worth publishing.

Most AI benchmarks measure whether agents can solve isolated puzzles. AIRS-Bench measures whether they can navigate the full lifecycle of research: idea formation, implementation, analysis, and iterative refinement. The findings reveal something important that contradicts both the hype and the cynicism. Agents exceed human performance on only four out of twenty tasks. But on all twenty, there's substantial room for improvement. This means the benchmark is measuring something real rather than something already solved.

Understanding the scientific work agents need to do

Real science doesn't look like a competition where you're given a well-defined problem and judged on your solution. Researchers understand an ambiguous goal, design an approach, implement it, run experiments, interpret unexpected results, refine their hypothesis, and iterate. Most of these steps involve failure. The failure teaches you what to try next.

Consider two different tasks. Task A: "Train a classifier on this dataset. What accuracy do you achieve?" There's one right answer. Task B: "A research group published a paper on text classification. Reproduce their methods, analyze what they found, and try to improve the results. No baseline code is provided." Task B is fundamentally harder because success is messier. You have to understand intent, not follow specifications. You need to know when something isn't working and why. You need to make good guesses about what to try next.

AIRS-Bench builds on this distinction. Instead of creating artificial puzzles, the researchers extracted twenty diverse research problems directly from published machine learning papers. These aren't toy problems. They're actual problems that appeared in real research, which means they contain the ambiguity and complexity of real scientific work.

The tasks span seven categories: code generation, mathematics, molecular and protein machine learning, question answering, text classification, text extraction and matching, and time series forecasting. This breadth matters because it tests whether agent capabilities transfer across different types of thinking. Can an agent that reasons well about text also reason about time series? Do code-writing intuitions transfer to pure mathematics?

Figure 3 shows the distribution of AIRS-Bench tasks across these seven categories, revealing the breadth of scientific domains the benchmark covers.

Distribution of AIRS-Bench tasks by category: Code, Math, Molecules & Proteins ML, Question Answering, Text Classification, Text Extraction & Matching, and Time Series.

The structure that enables comparison

Each AIRS-Bench task follows a simple template: {problem, dataset, metric}. The problem states the core computational challenge. The dataset is what you work with. The metric defines how success is measured. This structure is elegant because it's flexible enough to encompass wildly different research problems while remaining standardized enough to compare agents fairly.

Here's a crucial design choice: the benchmark deliberately doesn't provide baseline implementations. When a human researcher approaches a problem without reference code, they have to understand the problem from first principles, design a solution, implement it, and refine it based on what happens. AIRS-Bench puts agents through the same constraint. This is harder than "here's code, make it better," but it's more honest about what we're measuring.

Figure 1 shows an example task with its {problem, dataset, metric} structure, illustrating how AIRS-Bench specifies research challenges.

Example of an AIRS-Bench task specified by its problem statement, dataset, and evaluation metric.

An agent in this benchmark is defined as a pair: a language model plus a scaffold. The scaffold comprises the mechanisms—operators, search algorithms, reasoning strategies—that let the model explore the solution space effectively. Different agents use different scaffolds. Some rely on simple greedy approaches where the agent generates code and submits it. Others use more sophisticated search strategies that explore multiple paths and backtrack when something fails.

Figure 2 illustrates how agents combine an LLM with a scaffold to navigate scientific tasks.

An agent consists of a large language model paired with a scaffold—mechanisms that enable exploration of the solution space.

Measuring what actually matters

Standard metrics like "accuracy" or "test loss" work fine when you're measuring a single, well-defined output. But scientific work produces messiness. Agents might crash before submitting anything. They might produce output in the wrong format. They might produce valid output that's mediocre. We need metrics that capture these different dimensions of failure and success.

AIRS-Bench introduces three metrics that together paint a realistic picture of agent capability. The first is Valid Submission Rate (VSR). Can the agent produce output in the right format? This sounds trivial, but it's the first hurdle. If an agent can't complete the pipeline without crashing or producing malformed output, it can't do science, period. VSR reveals how often agents stumble at the basic mechanics of the task.

Once you have a valid submission, the second metric asks: how good is it? The normalized score places each agent's performance relative to human state-of-the-art performance and a theoretical upper bound for that task. This lets you compare across tasks with completely different scales. An agent might score perfectly on one task but 10 percent of optimal on another. The normalized score makes this comparison meaningful.

The third metric is Elo rating, borrowed from chess. Each agent's performance on a task is compared against every other agent's performance on that same task. This creates a tournament-style ranking that accounts for the relative difficulty of beating other agents. When different agents excel at different tasks, Elo provides a principled way to rank them overall.

These three metrics work together. VSR measures reliability. Normalized score measures quality. Elo provides a global ranking. Together, they show whether an agent is a reliable problem-solver, how close it gets to optimal, and how it stacks up against other approaches.

Figure 4 shows overall performance of all 14 evaluated agents across the three metrics, ordered by average normalized score.

Overall performance of the 14 evaluated agents on Valid Submission Rate, Average Normalized Score, and Elo Rating.

The humbling results

The researchers tested fourteen agents, from frontier models like GPT-4 and Claude to open-source alternatives, with various scaffolds and prompting strategies. The headline finding is stark: agents exceed human state-of-the-art performance on only four out of twenty tasks. On the remaining sixteen, they fall short.

But before interpreting this as failure, consider what it actually means. On the four tasks where agents beat human performance, they still don't reach the theoretical ceiling. There's room for improvement everywhere. This suggests AIRS-Bench is measuring something real and not yet saturated, which is far more useful than a benchmark where top agents have already maxed out.

The results reveal three layers of difficulty. The first is submission failure. Figure 5 shows that the overall valid submission rate is only 59.3 percent. This means agents fail to produce any valid output on nearly half their attempts. Why? They get confused about output format, run into execution errors, hit resource limits, or simply crash partway through the pipeline. This isn't about raw intelligence. It's about robustness when navigating complex workflows.

Figure 5 shows the distribution of submission rates across tasks for each agent, revealing high rates of invalid submissions.

Submission rate distribution for the 14 agents tested, showing the percentage of valid, invalid, and incomplete submissions.

When agents do produce valid outputs, the second layer becomes visible: most are mediocre. Figure 8 shows that agents cluster well below both human state-of-the-art and the theoretical ceiling. Agents aren't failing to execute pipelines. They're failing to solve problems well.

Figure 8 displays average normalized scores for all agents with 95 percent confidence intervals, showing widespread underperformance.

Average normalized scores for the 14 agents, indicating systematic underperformance relative to human benchmarks.

The third layer is task difficulty calibration. Figure 10 breaks results by task difficulty: easy, medium, hard, and expert. Agents do better on easier tasks but still don't saturate them. On expert tasks, performance collapses. This pattern suggests the benchmark is properly calibrated. Easy tasks are actually easier, and hard tasks are actually hard.

Figure 10 shows normalized scores grouped by task difficulty level, revealing degraded performance as difficulty increases.

Average normalized scores per task difficulty level, showing significant performance degradation from easy to expert tasks.

Comparing different agents reveals that scaffold choice and prompting strategy matter, but not dramatically. Figure 11 shows Elo ratings for all agents tested. The greedy scaffold outperforms more sophisticated search strategies in some cases, but differences aren't massive. This suggests the bottleneck isn't the reasoning algorithm itself. It's something deeper about how agents approach novel problems they haven't seen before.

Figure 11 displays Elo ratings for all 14 agents, comparing their relative performance across all tasks.

Elo ratings of all agents, showing comparative performance across the benchmark.

What this reveals about AI in science

AIRS-Bench functions as a diagnostic tool. It doesn't declare that agents can't do science. It reveals that agents can't yet do science reliably, consistently, and without human oversight. The benchmark identifies specific bottlenecks worth addressing.

Agents need better exploration strategies when facing genuinely novel problems. They need robustness: the ability to handle unexpected data formats, edge cases, and ambiguous task specifications. They need to understand when an approach isn't working and why, not just register that it failed. These are concrete research directions rather than vague critiques.

The practical implication is that AI agents can probably be useful collaborators on specific tasks within the research lifecycle, even if they can't run the full pipeline autonomously. They might excel at hypothesis testing, idea generation, or literature review, while humans remain essential for big-picture thinking and validation of results. This is less exciting than the vision of fully autonomous AI researchers, but it's more aligned with what the benchmark actually shows agents can do.

For benchmarking more broadly, AIRS-Bench suggests a template for evaluating systems on work that doesn't have clean, isolated solutions. Rather than creating artificial tasks that happen to be solvable, use actual research problems. This approach could become standard for assessing agent capabilities in other domains where real-world messiness is central to the work.

The bigger insight is that AI capability isn't binary. Agents don't either "do science" or "don't do science." They succeed on some tasks in some domains with some success rate. AIRS-Bench makes this granular picture visible, which is valuable because it points toward realistic deployment: where can agents reliably help? What still requires human judgment? Where is the boundary between human and machine thinking productive?

The benchmark is open-sourced, so it will evolve. As new agents emerge and task difficulties shift with time, AIRS-Bench becomes a living measure of progress in autonomous science. The fact that it's far from saturated means researchers have a clear challenge: can you build an agent that performs better across these tasks? That competitive dimension, more than any individual result, is what makes this work useful.

The findings also connect to broader questions about how we evaluate AI for real-world deployment. Related work on benchmarking AI agents for scientific research, such as AstaBeach's rigorous evaluation methodology and InnovatorBench's focus on innovative capability, explores similar themes: how do we measure whether agents can actually do valuable work rather than just solve puzzles?

AIRS-Bench matters because it refuses the easy answer. It doesn't measure something that's already solved. It doesn't measure something too narrow to matter. It measures real scientific work, in all its messy complexity, and shows us exactly where agents fall short and where they might improve. That clarity is far more useful than any headline claiming AI has solved research.

This is a Plain English Papers summary of a research paper called AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.