AI Can Now Do Expert-Level Work (Almost). 5 Surprising Findings from a Landmark 'GDPval' Study

Written by hacker-Antho | Published 2025/09/30
Tech Story Tags: ai | generative-ai | llms | open-ai | openai-benchmark | ai-testing-benchmark | gdpval-study | top-ai-in-the-world

TLDROpenAI has released a benchmark for evaluating AI on complex professional tasks. The results show that the best AI models are beginning to perform at a level comparable to highly experienced industry experts. There is no single “best” AI for every job, and different models demonstrate distinct strengths.via the TL;DR App

Introduction: Moving Beyond the Hype to See What AI Can Really Do

The debate over AI’s impact on the job market is filled with speculation. But trying to measure its real-world effect using historical data like the adoption of electricity gives us only lagging indicators of a shift that’s already underway. What we’ve needed is a leading indicator, a way to see what AI is capable of right now.

A groundbreaking new benchmark from OpenAI, called GDPval, provides exactly that. Unlike typical academic tests, GDPval evaluates AI models on complex, real-world tasks sourced directly from industry professionals with an average of 14 years of experience. The results provide one of the clearest pictures yet of what today’s most advanced AI can, and can’t, do in a professional setting. Here are the five most surprising takeaways.

Takeaway 1: On Complex Professional Tasks, AI Is Approaching Human-Expert Quality

The study’s most significant finding is that the best AI models are beginning to perform at a level comparable to highly experienced industry experts, and this capability is improving roughly linearly over time. The tasks evaluated were not simple queries; they were complex projects requiring an average of 7 hours for a human professional to complete.

Against this high bar, the results were striking. On the GDPval benchmark, deliverables from the top-performing model, Claude Opus 4.1, were judged to be better than or as good as the human expert’s work in 47.6% of cases. When combining the wins and ties for the best models, AI-generated deliverables matched or outperformed the human expert in just over half of the tasks. This suggests that AI’s ability to handle long-horizon, subjective knowledge work is far more advanced than many have assumed.

Takeaway 2: The “Best” AI Depends on the Job: A Battle of Accuracy vs. Aesthetics

The study evaluated several frontier models — including GPT-5, GPT-4o, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4 — and revealed that there is no single “best” AI for every job. Instead, different models demonstrate distinct strengths, making tool selection a critical factor for professional use. The two top models highlighted this trade-off clearly:

  • Claude Opus 4.1 was the best-performing model overall, with a particular strength in aesthetics. It excelled at tasks involving visual presentation, performing better on file types like .pdf, .xlsx, and .ppt where document formatting and professional slide layouts are key.
  • GPT-5 demonstrated a clear advantage in accuracy. It was superior at carefully following detailed instructions and performing correct calculations, making it a stronger choice for tasks requiring precision in pure text.

This distinction is crucial. It shows that effectively integrating AI into professional workflows isn’t just about using any AI, but about choosing the right tool for the specific demands of the task at hand.

Takeaway 3: AI’s Biggest Flaw Isn’t Hallucination — It’s Following Simple Directions

While much of the public conversation around AI failures focuses on “hallucinations,” the study found a more mundane but critical issue. The single most common reason that experts rejected an AI’s work was its simple failure to fully follow instructions.

This was a primary weakness for models like Claude, Grok, and Gemini. In contrast, GPT-5 had the fewest instruction-following issues, but its deliverables were most often rejected due to formatting errors. This is a surprising and important takeaway, as it shifts the focus from failures in complex reasoning to more fundamental challenges in compliance and attention to detail. Crucially, this finding directly explains why the “AI co-pilot” model requires such careful human oversight, as we’ll see next.

Takeaway 4: The “AI Co-pilot” Is Real, But Savings Require a Human in the Loop

The study’s analysis of speed and cost savings confirms the value of the “AI co-pilot” model, but with a critical caveat: human oversight is non-negotiable. A “naive” comparison can be misleading; for instance, the data for GPT-5 showed it could generate an initial deliverable 90 times faster than a human expert.

However, when researchers modeled a more realistic workflow of “try the AI, review the output, and fix it yourself if it’s wrong,” the gains shrank dramatically. In this scenario, the net speed improvement from using GPT-5 was just 1.12 times. This data, based only on OpenAI’s models, illustrates that realizing time and cost benefits is entirely dependent on having a human expert in the loop to review, validate, and correct the AI’s work.

Interestingly, the researchers note this calculation likely underestimates the true savings, as it over-penalizes the AI by assuming the human has to start from scratch after every failed attempt. Still, it proves AI’s immediate economic value lies in augmenting experts, not replacing them.

Takeaway 5: You Can Make AI Smarter Just by Asking It to Double-Check Its Work

One of the most practical findings was how easily AI performance can be improved through better prompting. Researchers gave GPT-5 a special prompt containing a detailed checklist, essentially asking it to double-check its own work for common errors. The results were significant:

  • It completely eliminated “black-square artifacts” that had previously appeared in over half of its generated PDFs.
  • It cut “egregious formatting errors” in PowerPoint files from 86% down to 64%.
  • Overall, it improved the model’s win rate against human experts by 5 percentage points.

The mechanism behind this improvement wasn’t magic, but engineering. The new prompt caused a sharp increase in the agent using its multi-modal capabilities to visually inspect its own deliverables, jumping from 15% to 97%. This shows that users can dramatically improve AI quality by guiding it to be more thorough and self-critical.

Conclusion: The Dawn of the AI-Augmented Professional

The GDPval benchmark provides clear evidence that AI is rapidly evolving into a capable tool for serious, complex knowledge work. However, its application is nuanced. This study focused on self-contained, precisely-specified tasks, not the interactive, ambiguous challenges that define much of professional life. The findings show we are not on the verge of mass replacement, but rather entering an era of human-AI collaboration. The true potential is unlocked by professionals who know how to choose the right model, provide clear instructions, and maintain rigorous expert oversight.

These models are already this capable; what happens to the world of work when they get just a little bit better?


Written by hacker-Antho | Managing Director @ VML | Founder @ Fourth -Mind
Published by HackerNoon on 2025/09/30