Can Your AI Actually Use a Computer? A 2025 Map of Computer‑Use Benchmarks

Written by ashtonchew12 | Published 2025/12/10
Tech Story Tags: ai | reinforcement-learning | compuer-use-agent | ai-agent | agi | ai-benchmarks | llm-evals | hackernoon-top-story

TLDRThis article maps today’s computer use benchmarks across three layers (UI grounding, web agents, full OS use), shows how a few anchors like ScreenSpot, Mind2Web, REAL, OSWorld and CUB are emerging, explains why scaffolding and harnesses often drive more gains than model size, and gives practical guidance on which evals to use if you are building GUI models, web agents, or full computer use agents.via the TL;DR App

If you’ve seen "computer-use agents", you've noticed two facts:

1. Every new model is "SOTA" on something.

2. Almost none of those numbers line up.

OSWorld, CUB, Web Bench, Westworld, REAL, Mind2Web, ScreenSpot, GroundUI, Showdown-Clicks, WebClick… plus a dozen vendor-run leaderboards.

It feels more and more like early web frameworks. Too many options and not enough direction.

This post is an attempt to put the current ecosystem into one coherent picture: what's out there, how the benchmarks differ, and where this all is heading.

The three layers of "Computer-Use"

Almost every "computer-use" benchmark falls into one of three layers:

1. Low-level UI grounding – Localizing and identifying interface elements from screenshots

2. Web task execution – Multi-step task completion within browser environments

3. Full OS / multi-app computer use – Cross-application workflows on complete operating systems

Layer 1 – UI Grounding

These benchmarks take a screenshot and an instruction and ask the model to point at the right place (pixel, box, or UI element).

Core examples include the ScreenSpot family, which serves as the workhorse of GUI grounding. The original ScreenSpot covers web, mobile, and desktop UIs; ScreenSpot-v2 cleans up the labeling; ScreenSpot-Pro targets high-resolution professional apps across multiple industries and OSes.

GroundUI takes a different approach by mashing up ScreenSpot, Mind2Web, OmniACT and friends into an ~18k-example multi-platform dataset, plus a standard 1k-example eval subset.

Showdown-Clicks offers 5,679 human clicks from people doing tasks in a macOS desktop environment, used as a click-prediction benchmark.

Meanwhile, WebClick from H Company provides 1,600+ web screenshots with "click here" labels, used by Holo1/Holo1.5 to show off small-model UI localization.

If you're training the "eyes" of an agent (a Vision-Language Model that can read screens and pick widgets), the benchmark is here. Almost every GUI agent paper now reports ScreenSpot / ScreenSpot-Pro / GroundUI / Showdown-Clicks numbers.

Layer 2 – Web-based agents

Here, the agent gets an actual browser (or a high-fidelity simulator) and has to complete tasks like "During the summer, book a hotel in New York City under $250" or "find the return policy for this product and make a return request for my most recent item.”

The Mind2Web family dominates this space. The offline dataset contains 2,350 tasks across 137 real websites and 31 domains, with action sequences. Online Mind2Web is the live equivalent: 300 tasks on 136 real websites, with a leaderboard that tracks accuracy, cost, and runs. Mind2Web 2 extends this with 130 long-horizon, research-style search tasks and adds "agent-as-a-judge" for answer correctness and attribution.

WebArena takes a different approach: it's a self-hosted web environment built from realistic mock sites (e-commerce, forums, GitLab-style repos, CMS, etc.) with hundreds of tasks that mimic everyday web tasks. REAL from AGI, Inc. offers 112 tasks across replicas of major sites like Amazon and DoorDash, with separate reward functions for "did you get the right info?" and "did you take the right actions?"

Web Bench & Westworld from Halluminate focus on scale: Web Bench is 5,750 tasks across 452 real sites, while Westworld is a much smaller suite of realistic browser synthetic simulators with verifiable rewards.

Finally, WebVoyager defined tasks on 15 popular live websites, plus an automatic evaluation protocol using GPT-4V to judge open-ended behavior.

Web-based agents have been growing in popularity for their promise in automating tasks due to the action space being smaller than the next layer, full OS computer use. Most web-only agents benchmark here and then scale up to OS-level benchmarks.

Layer 3 – Full computer use

The final layer gives the agent a full OS: multiple apps, file system, copy-paste, etc. OSWorld serves as the anchor here, with 369 tasks on real Ubuntu / Windows / macOS machines spanning browsers, Office apps, file explorers, IDEs, email, media players, and more. Humans hit ~72% success; early best agents were around 12%. The OSWorld-Verified & OSWorld-Human extensions provide a cleaned-up harness plus human trajectories for all tasks, which let you measure not just if the agent succeeds but how many steps and how much time it burns compared to humans.

CUB (Computer Use Benchmark) from Theta is a cross-vertical benchmark for long-horizon desktop + browser workflows. Leading AI agent companies like Manus AI display the CUB leaderboard scores alongside numbers from GAIA, a general AI agent benchmark with a few browser workflows.

SCUBA from Salesforce takes a different approach: it's a Salesforce-internal benchmark built from ~300 real CRM workflows covering admin, sales, and service tasks. Their approach is to take a deeply verticalized enterprise SaaS view of the benchmark.

This final layer feels closest to an agent acting as a knowledge worker to the fullest. Accordingly, it is also the most difficult layer by far. Agents often perform poorly on these benchmarks (often low double-digit success rates) because of the varied environments and edge cases in a full OS environment.

Harness > model

Ben Anderson's post on computer-use evals makes a brutal but fair point: a lot of "SOTA" is actually prompt engineering plus scaffolding.

On popular benchmark Showdown-Clicks, for example, the original paper reports ~20% accuracy for a big off-the-shelf model while small finetuned models get ~70–80%.

Ben finds that Qwen’s 72B model gets the score of a mere ~20%. But then he swaps in a much simpler "click-only" XML prompt and sees his small 3B Qwen model jump to around 50% on the exact same benchmark. Here is the short prompt Ben used for the 250% increase in score despite the much smaller model:

Determine where to click in the UI to complete the instruction/task.
Report the click location in XML like '<points x1="x" y1="y">click</points>.'
Image size [{w}, {h}], (0, 0) is the top-left and ({w}, {h}) is the bottom-right.
[IMAGE]
{instruction}

Similar stories show up elsewhere. REAL uses its own harness and reward functions for information and action tasks. ScreenSuite explicitly warns that its vision-only setup means Mind2Web-style scores aren't directly comparable to DOM-based agents.

For computer-use benchmarks today, a sizeable chunk of the performance gap you see on leaderboards is harness (prompts, tools, termination rules, retries, judges), not model weights. If you're comparing numbers across papers without looking at scaffolding, you're mostly reading marketing.

Convergence to a small set of "anchor" benchmarks

Despite the chaos, you can already see the field standardizing around a few anchors. For the grounding layer: ScreenSpot (including Pro), GroundUI, WebClick, and Showdown-Clicks. For the web layer: the trio of Mind2Web (offline + online + v2), plus WebArena and one of Web Bench / WebVoyager. For the OS layer: OSWorld (plus Verified and Human variants), CUB, and SCUBA. On top of that, ScreenSuite from Hugging Face acts as an umbrella harness that wraps many of these into one framework.

Any "computer-use agent" release is normally expected to report 1–2 grounding scores (ScreenSpot-v2/Pro, GroundUI, WebClick, Showdown-Clicks), 1–2 web scores (Online Mind2Web, Web Bench, REAL, Westworld), and 1–2 OS scores (OSWorld-Verified, CUB, SCUBA).

The shift from measurement to production

Early benchmarks just asked "success or failure." That's already starting to look quaint.

OSWorld-Human shows that even strong agents take 1.4–2.7× more steps than humans on these tasks; some trivial actions (like reformatting text) take agents minutes where a human needs seconds. Online Mind2Web tracks the metric of cost (API spend) and reliability across runs. REAL exposes multiple reward functions and emphasizes robustness across different scaffolds. The scoreboard is moving from single numbers ("accuracy") to profiles (“capability”, “reliability”, “cost”, “latency”).

The fundamental shift from research-grade thinking to production-level may be an early indicator that the “computer-use agent” is healthily progressing. In fact, early production deployments of the “computer-use agent” Nova Act from Amazon AGI’s SF lab have been publicized. In a recent blog, the lab shared customer stories showing Nova Act handle workflows in the enterprise such as complex form filling and long administrative processes.

Where do the named "brands" sit?

UI-TARS from ByteDance is a single screenshot-driven agent that reports numbers on ScreenSpot-Pro and OSWorld, spanning all three layers.

H Company specializes in grounding and shows results on ScreenSpot-v2, ScreenSpot-Pro, GroundUI-Web, Showdown-Clicks, and its very own WebClick benchmark.

AGI, Inc. focuses on the web and OS layers via their own REAL and the established OSWorld leaderboards.

Theta concentrates on the OS and browser layer via CUB.

Benchmarks doubled as go-to-market channels

Many of these benchmarks also act as distribution and data engines. AGI, Inc. built REAL and then an SDK plus agents around it; being "#1 on REAL" is both a research claim and a funnel into their product. Theta's CUB is positioned as "Humanity's last exam for computer use agents.” Halluminate uses Westworld and Web Bench as both benchmarks and infrastructure for running browser agents at scale.

Benchmarks are becoming part measurement, part distribution, and part data flywheel. If you're picking which ones to invest in, you're also picking which ecosystems you want to plug into.

The shift from live sites to synthetic sandboxes

Many first-wave web benchmarks evaluated agents directly on live sites. Mind2Web and Online Mind2Web run tasks on real, changing webpages from over 100 popular sites. WebVoyager and Web Bench similarly use tasks on real websites like Amazon, Apple, Google Flights and hundreds of other high-traffic domains. This gives realism, but makes evaluation brittle: sites change, DOMs drift, and reliable automatic reward signals are hard to maintain at scale. In practice, large-scale parallel evaluation can run into rate limits or website terms-of-service constraints.

The emerging alternative is high-fidelity synthetic environments with built-in, programmatic rewards. WebArena provides a self-hosted “mini web” of fully functional sites (e-commerce, forums, project tools, CMS) whose state is fully observable and reproducible. Theta’s CUB positions itself as “Humanity’s Last Exam for Computer and Browser Use Agents,” highlighting the complexity of tasks that can be made in these realistic environments. REAL (from AGI, Inc.) builds deterministic replicas of 11 widely used websites and evaluates agents via programmatic state checks plus rubric-based judging. Halluminate’s Westworld offers a “fully simulated internet” of browser environments for economically meaningful workflows, complementing their Web Bench benchmark on live sites. In fact Halluminate’s first benchmark Web Bench was used on live sites and they moved to doing private synthetic sites in Westworld, their most recent benchmark. Moreover, WARC-Bench goes further by recording dynamic, realistic webpages into interactive Web ARChive files with programmatic reward functions.

Synthetic setups trade some realism for measurement quality. A simulated Amazon or flights site may miss rare edge cases you’d see on the real web, and there is an active interest in studying the “sim-to-real” gap, for example by comparing Westworld-style simulators with tasks on real Google Flights. But in return, these sandboxes offer stable tasks, precise ground truth, and safe, massively parallel evaluation.

Given this, the trajectory is clear: live-web benchmarks remain essential for checking real-world performance, but the center of gravity for day-to-day agent evaluation is moving toward realistic, instrumented sandboxes with explicit reward functions and full observability. Especially as there is a shift towards private websites for enterprise use cases.

How to use this if you're building agents

If you're trying to ship an agent, here's a pragmatic checklist.

For all evaluations, avoid creating custom harnesses optimized for a single benchmark. To ensure meaningful results beyond launch announcements, use established public harnesses and document your implementation choices. Now onto the specific patterns per agent type:

If you're building a GUI-aware model

Your priorities should be to train on ScreenSpot + GroundUI + WebClick style data, then report on ScreenSpot-v2 / ScreenSpot-Pro / GroundUI-1K / WebClick / Showdown-Clicks, ideally via the ScreenSuite harness where applicable for standardization. You're optimizing for localization accuracy and robustness to varied UI skins.

If you're building a web agent

Start with Mind2Web (offline) to debug basic behavior. Move to Online Mind2Web + REAL for live behavior and cost curves. Consider Web Bench (real web, wide coverage) and WebArena / Westworld (self-hosted, simulated but realistic environments) once you care about distribution shift and robustness. Your north star becomes: success rate and reliability and cost per task.

If you're building a full “computer-use agent”

Use OSWorld-Verified as the standard ability check. Study OSWorld-Human to understand where you're much slower or more brittle than humans. If you're selling into enterprises, consider CUB and relevant vertical benchmarks like SCUBA.

The benchmarks are maturing faster than the agents, but they're still broken

A year ago, "computer-use" benchmarks were fragmented. Today we have a more complete benchmark stack. Grounding benchmarks that stress-test vision models on every UI imaginable. Web benchmarks spanning thousands of real sites. OS benchmarks that replicate actual knowledge work.

The best agents still struggle. Low success rates on OSWorld. Step counts 2x longer than humans. Costs that turn deployment into a CFO problem.

But there's a deeper issue. As Anderson showed, half the performance gap on these benchmarks is scaffolding, not model quality. A 3B model with the right prompt can beat a 72B model with a naive one. The "everyone is SOTA on something" problem hasn't been solved. It's just moved from benchmark selection to harness engineering.

The chaos is starting to resolve around ScreenSpot/GroundUI for grounding, Mind2Web/REAL for web tasks, and OSWorld/CUB for full OS execution. But more importantly, people are catching on. When production deployments start, scaffolding tricks stop working. The benchmarks that survive will be the ones where performance actually predicts real-world behavior.

What matters now is rigor. Run the standard evals with public harnesses. The gap between benchmark performance and production reality is where all the actual work lives. The measurement infrastructure exists and will only get better. Scrutiny is coming and you should build for that world, not this one.


References

Layer 1 – UI grounding

Layer 2 – Web-based agents

Layer 3 – Full computer / multi-app use

Cross-layer / general agent benchmarks mentioned

  • GAIA – Benchmark for General AI Assistants (450 real-world questions across three difficulty levels requiring tools, browsing, and multimodal reasoning): https://arxiv.org/abs/2311.12983

Ben Anderson’s blog post “Computer-Use Evals are a Mess” https://benanderson.work/blog/computer-use-benchmarks/

Disclaimer: I am currently working at Theta


Written by ashtonchew12 | Founding Engineer @ Theta (YC X25), working on RL and RL infrastructure
Published by HackerNoon on 2025/12/10