Can Your AI Actually Use a Computer? A 2025 Map of Computer‑Use Benchmarks

If you’ve seen "computer-use agents", you've noticed two facts: 1. Every new model is "SOTA" on something. 2. Almost none of those numbers line up. OSWorld, CUB, Web Bench, Westworld, REAL, Mind2Web, ScreenSpot, GroundUI, Showdown-Clicks, WebClick… plus a dozen vendor-run leaderboards. OSWorld CUB, Web Bench Westworld REAL Mind2Web ScreenSpot GroundUI Showdown-Clicks WebClick It feels more and more like early web frameworks. Too many options and not enough direction. This post is an attempt to put the current ecosystem into one coherent picture: what's out there, how the benchmarks differ, and where this all is heading. The three layers of "Computer-Use" Almost every "computer-use" benchmark falls into one of three layers: 1. Low-level UI grounding – Localizing and identifying interface elements from screenshots Low-level UI grounding 2. Web task execution – Multi-step task completion within browser environments Web task execution 3. Full OS / multi-app computer use – Cross-application workflows on complete operating systems Full OS / multi-app computer use Layer 1 – UI Grounding These benchmarks take a screenshot and an instruction and ask the model to point at the right place (pixel, box, or UI element). Core examples include the ScreenSpot family, which serves as the workhorse of GUI grounding. The original ScreenSpot covers web, mobile, and desktop UIs; ScreenSpot-v2 cleans up the labeling; ScreenSpot-Pro targets high-resolution professional apps across multiple industries and OSes. ScreenSpot family ScreenSpot ScreenSpot-v2 ScreenSpot-Pro GroundUI takes a different approach by mashing up ScreenSpot, Mind2Web, OmniACT and friends into an ~18k-example multi-platform dataset, plus a standard 1k-example eval subset. GroundUI ScreenSpot Mind2Web OmniACT Showdown-Clicks offers 5,679 human clicks from people doing tasks in a macOS desktop environment, used as a click-prediction benchmark. Showdown-Clicks Meanwhile, WebClick from H Company provides 1,600+ web screenshots with "click here" labels, used by Holo1/Holo1.5 to show off small-model UI localization. WebClick H Company If you're training the "eyes" of an agent (a Vision-Language Model that can read screens and pick widgets), the benchmark is here. Almost every GUI agent paper now reports ScreenSpot / ScreenSpot-Pro / GroundUI / Showdown-Clicks numbers. ScreenSpot ScreenSpot-Pro GroundUI Showdown-Clicks Layer 2 – Web-based agents Here, the agent gets an actual browser (or a high-fidelity simulator) and has to complete tasks like "During the summer, book a hotel in New York City under $250" or "find the return policy for this product and make a return request for my most recent item.” The Mind2Web family dominates this space. The offline dataset contains 2,350 tasks across 137 real websites and 31 domains, with action sequences. Online Mind2Web is the live equivalent: 300 tasks on 136 real websites, with a leaderboard that tracks accuracy, cost, and runs. Mind2Web 2 extends this with 130 long-horizon, research-style search tasks and adds "agent-as-a-judge" for answer correctness and attribution. Mind2Web Online Mind2Web Mind2Web 2 WebArena takes a different approach: it's a self-hosted web environment built from realistic mock sites (e-commerce, forums, GitLab-style repos, CMS, etc.) with hundreds of tasks that mimic everyday web tasks. REAL from AGI, Inc. offers 112 tasks across replicas of major sites like Amazon and DoorDash, with separate reward functions for "did you get the right info?" and "did you take the right actions?" WebArena REAL AGI, Inc. Web Bench & Westworld from Halluminate focus on scale: Web Bench is 5,750 tasks across 452 real sites, while Westworld is a much smaller suite of realistic browser synthetic simulators with verifiable rewards. Web Bench & Westworld Halluminate Web Bench Westworld Finally, WebVoyager defined tasks on 15 popular live websites, plus an automatic evaluation protocol using GPT-4V to judge open-ended behavior. WebVoyager Web-based agents have been growing in popularity for their promise in automating tasks due to the action space being smaller than the next layer, full OS computer use. Most web-only agents benchmark here and then scale up to OS-level benchmarks. Layer 3 – Full computer use The final layer gives the agent a full OS: multiple apps, file system, copy-paste, etc. OSWorld serves as the anchor here, with 369 tasks on real Ubuntu / Windows / macOS machines spanning browsers, Office apps, file explorers, IDEs, email, media players, and more. Humans hit ~72% success; early best agents were around 12%. The OSWorld-Verified & OSWorld-Human extensions provide a cleaned-up harness plus human trajectories for all tasks, which let you measure not just if the agent succeeds but how many steps and how much time it burns compared to humans. OSWorld Humans hit ~72% success; early best agents were around 12% OSWorld-Verified & OSWorld-Human CUB (Computer Use Benchmark) from Theta is a cross-vertical benchmark for long-horizon desktop + browser workflows. Leading AI agent companies like Manus AI display the CUB leaderboard scores alongside numbers from GAIA, a general AI agent benchmark with a few browser workflows. CUB (Computer Use Benchmark) Theta cross-vertical benchmark Manus AI CUB GAIA SCUBA from Salesforce takes a different approach: it's a Salesforce-internal benchmark built from ~300 real CRM workflows covering admin, sales, and service tasks. Their approach is to take a deeply verticalized enterprise SaaS view of the benchmark. SCUBA This final layer feels closest to an agent acting as a knowledge worker to the fullest. Accordingly, it is also the most difficult layer by far. Agents often perform poorly on these benchmarks (often low double-digit success rates) because of the varied environments and edge cases in a full OS environment. low double-digit success rates Harness > model Ben Anderson's post on computer-use evals makes a brutal but fair point: a lot of "SOTA" is actually prompt engineering plus scaffolding. post on computer-use evals On popular benchmark Showdown-Clicks, for example, the original paper reports ~20% accuracy for a big off-the-shelf model while small finetuned models get ~70–80%. Showdown-Clicks ~20% accuracy for a big off-the-shelf model while small finetuned models get ~70–80% Ben finds that Qwen’s 72B model gets the score of a mere ~20%. But then he swaps in a much simpler "click-only" XML prompt and sees his small 3B Qwen model jump to around 50% on the exact same benchmark. Here is the short prompt Ben used for the 250% increase in score despite the much smaller model: Qwen’s 72B model 3B Qwen model Determine where to click in the UI to complete the instruction/task. Report the click location in XML like ' click .' Image size [{w}, {h}], (0, 0) is the top-left and ({w}, {h}) is the bottom-right. [IMAGE] {instruction} Determine where to click in the UI to complete the instruction/task. Report the click location in XML like ' click .' Image size [{w}, {h}], (0, 0) is the top-left and ({w}, {h}) is the bottom-right. [IMAGE] {instruction} Similar stories show up elsewhere. REAL uses its own harness and reward functions for information and action tasks. ScreenSuite explicitly warns that its vision-only setup means Mind2Web-style scores aren't directly comparable to DOM-based agents. REAL ScreenSuite Mind2Web For computer-use benchmarks today, a sizeable chunk of the performance gap you see on leaderboards is harness (prompts, tools, termination rules, retries, judges), not model weights. If you're comparing numbers across papers without looking at scaffolding, you're mostly reading marketing. Convergence to a small set of "anchor" benchmarks Despite the chaos, you can already see the field standardizing around a few anchors. For the grounding layer: ScreenSpot (including Pro), GroundUI, WebClick, and Showdown-Clicks. For the web layer: the trio of Mind2Web (offline + online + v2), plus WebArena and one of Web Bench / WebVoyager. For the OS layer: OSWorld (plus Verified and Human variants), CUB, and SCUBA. On top of that, ScreenSuite from Hugging Face acts as an umbrella harness that wraps many of these into one framework. ScreenSpot GroundUI WebClick Showdown-Clicks Mind2Web WebArena Web Bench WebVoyager OSWorld CUB SCUBA ScreenSuite Any "computer-use agent" release is normally expected to report 1–2 grounding scores (ScreenSpot-v2/Pro, GroundUI, WebClick, Showdown-Clicks), 1–2 web scores (Online Mind2Web, Web Bench, REAL, Westworld), and 1–2 OS scores (OSWorld-Verified, CUB, SCUBA). ScreenSpot-v2/Pro GroundUI WebClick Showdown-Clicks Online Mind2Web, Web Bench REAL Westworld OSWorld-Verified CUB SCUBA The shift from measurement to production Early benchmarks just asked "success or failure." That's already starting to look quaint. OSWorld-Human shows that even strong agents take 1.4–2.7× more steps than humans on these tasks; some trivial actions (like reformatting text) take agents minutes where a human needs seconds. Online Mind2Web tracks the metric of cost (API spend) and reliability across runs. REAL exposes multiple reward functions and emphasizes robustness across different scaffolds. The scoreboard is moving from single numbers ("accuracy") to profiles (“capability”, “reliability”, “cost”, “latency”). OSWorld-Human 1.4–2.7× Online Mind2Web REAL The fundamental shift from research-grade thinking to production-level may be an early indicator that the “computer-use agent” is healthily progressing. In fact, early production deployments of the “computer-use agent” Nova Act from Amazon AGI’s SF lab have been publicized. In a recent blog, the lab shared customer stories showing Nova Act handle workflows in the enterprise such as complex form filling and long administrative processes. Nova Act Amazon AGI’s SF lab customer stories Where do the named "brands" sit? UI-TARS from ByteDance is a single screenshot-driven agent that reports numbers on ScreenSpot-Pro and OSWorld, spanning all three layers. UI-TARS ByteDance ScreenSpot-Pro OSWorld H Company specializes in grounding and shows results on ScreenSpot-v2, ScreenSpot-Pro, GroundUI-Web, Showdown-Clicks, and its very own WebClick benchmark. ScreenSpot-v2 ScreenSpot-Pro GroundUI-Web Showdown-Clicks WebClick AGI, Inc. focuses on the web and OS layers via their own REAL and the established OSWorld leaderboards. REAL OSWorld Theta concentrates on the OS and browser layer via CUB. CUB Benchmarks doubled as go-to-market channels Many of these benchmarks also act as distribution and data engines. AGI, Inc. built REAL and then an SDK plus agents around it; being "#1 on REAL" is both a research claim and a funnel into their product. Theta's CUB is positioned as "Humanity's last exam for computer use agents.” Halluminate uses Westworld and Web Bench as both benchmarks and infrastructure for running browser agents at scale. REAL SDK CUB Westworld Web Bench Benchmarks are becoming part measurement, part distribution, and part data flywheel. If you're picking which ones to invest in, you're also picking which ecosystems you want to plug into. The shift from live sites to synthetic sandboxes Many first-wave web benchmarks evaluated agents directly on live sites. Mind2Web and Online Mind2Web run tasks on real, changing webpages from over 100 popular sites. WebVoyager and Web Bench similarly use tasks on real websites like Amazon, Apple, Google Flights and hundreds of other high-traffic domains. This gives realism, but makes evaluation brittle: sites change, DOMs drift, and reliable automatic reward signals are hard to maintain at scale. In practice, large-scale parallel evaluation can run into rate limits or website terms-of-service constraints. Mind2Web Online Mind2Web WebVoyager Web Bench The emerging alternative is high-fidelity synthetic environments with built-in, programmatic rewards. WebArena provides a self-hosted “mini web” of fully functional sites (e-commerce, forums, project tools, CMS) whose state is fully observable and reproducible. Theta’s CUB positions itself as “Humanity’s Last Exam for Computer and Browser Use Agents,” highlighting the complexity of tasks that can be made in these realistic environments. REAL (from AGI, Inc.) builds deterministic replicas of 11 widely used websites and evaluates agents via programmatic state checks plus rubric-based judging. Halluminate’s Westworld offers a “fully simulated internet” of browser environments for economically meaningful workflows, complementing their Web Bench benchmark on live sites. In fact Halluminate’s first benchmark Web Bench was used on live sites and they moved to doing private synthetic sites in Westworld, their most recent benchmark. Moreover, WARC-Bench goes further by recording dynamic, realistic webpages into interactive Web ARChive files with programmatic reward functions. WebArena CUB REAL Westworld Web Bench Web Bench Westworld WARC-Bench recording dynamic, realistic webpages into interactive Web ARChive files Synthetic setups trade some realism for measurement quality. A simulated Amazon or flights site may miss rare edge cases you’d see on the real web, and there is an active interest in studying the “sim-to-real” gap, for example by comparing Westworld-style simulators with tasks on real Google Flights. But in return, these sandboxes offer stable tasks, precise ground truth, and safe, massively parallel evaluation. for example by comparing Westworld-style simulators with tasks on real Google Flights Given this, the trajectory is clear: live-web benchmarks remain essential for checking real-world performance, but the center of gravity for day-to-day agent evaluation is moving toward realistic, instrumented sandboxes with explicit reward functions and full observability. Especially as there is a shift towards private websites for enterprise use cases. How to use this if you're building agents If you're trying to ship an agent, here's a pragmatic checklist. For all evaluations, avoid creating custom harnesses optimized for a single benchmark. To ensure meaningful results beyond launch announcements, use established public harnesses and document your implementation choices. Now onto the specific patterns per agent type: If you're building a GUI-aware model Your priorities should be to train on ScreenSpot + GroundUI + WebClick style data, then report on ScreenSpot-v2 / ScreenSpot-Pro / GroundUI-1K / WebClick / Showdown-Clicks, ideally via the ScreenSuite harness where applicable for standardization. You're optimizing for localization accuracy and robustness to varied UI skins. ScreenSpot GroundUI WebClick ScreenSpot-v2 ScreenSpot-Pro GroundUI-1K WebClick Showdown-Clicks ScreenSuite If you're building a web agent Start with Mind2Web (offline) to debug basic behavior. Move to Online Mind2Web + REAL for live behavior and cost curves. Consider Web Bench (real web, wide coverage) and WebArena / Westworld (self-hosted, simulated but realistic environments) once you care about distribution shift and robustness. Your north star becomes: success rate and reliability and cost per task. Mind2Web Online Mind2Web REAL Web Bench WebArena Westworld If you're building a full “computer-use agent” Use OSWorld-Verified as the standard ability check. Study OSWorld-Human to understand where you're much slower or more brittle than humans. If you're selling into enterprises, consider CUB and relevant vertical benchmarks like SCUBA. OSWorld-Verified OSWorld-Human CUB SCUBA The benchmarks are maturing faster than the agents, but they're still broken A year ago, "computer-use" benchmarks were fragmented. Today we have a more complete benchmark stack. Grounding benchmarks that stress-test vision models on every UI imaginable. Web benchmarks spanning thousands of real sites. OS benchmarks that replicate actual knowledge work. The best agents still struggle. Low success rates on OSWorld. Step counts 2x longer than humans. Costs that turn deployment into a CFO problem. OSWorld But there's a deeper issue. As Anderson showed, half the performance gap on these benchmarks is scaffolding, not model quality. A 3B model with the right prompt can beat a 72B model with a naive one. The "everyone is SOTA on something" problem hasn't been solved. It's just moved from benchmark selection to harness engineering. The chaos is starting to resolve around ScreenSpot/GroundUI for grounding, Mind2Web/REAL for web tasks, and OSWorld/CUB for full OS execution. But more importantly, people are catching on. When production deployments start, scaffolding tricks stop working. The benchmarks that survive will be the ones where performance actually predicts real-world behavior. ScreenSpot GroundUI Mind2Web REAL OSWorld CUB What matters now is rigor. Run the standard evals with public harnesses. The gap between benchmark performance and production reality is where all the actual work lives. The measurement infrastructure exists and will only get better. Scrutiny is coming and you should build for that world, not this one. References Layer 1 – UI grounding ScreenSpot– Original multi-platform GUI grounding benchmark (mobile, desktop, web).https://llm-stats.com/benchmarks/screenspot ScreenSpot-v2– Updated GUI grounding benchmark with cleaner labels and broader coverage.https://huggingface.co/datasets/Voxel51/ScreenSpot-v2 ScreenSpot-Pro– High-resolution professional GUI grounding benchmark (23 apps, 5 industries, 3 OSes).https://arxiv.org/abs/2504.07981 GroundUI / GroundUI-1K– Multi-platform (web / desktop / mobile) grounding dataset with a 1K eval subset.Project / dataset:https://huggingface.co/datasets/agent-studio/GroundUI-1K Showdown-Clicks– 5,679 human clicks from macOS desktop tasks for click prediction and low-level control.https://huggingface.co/datasets/generalagents/showdown-clicks WebClick– 1,600+ web screenshots with “click here” labels; H Company’s benchmark for web localizers.https://huggingface.co/datasets/Hcompany/WebClick ScreenSuite– Hugging Face’s umbrella GUI-agent benchmarking harness covering perception + single/multi-step tasks.https://github.com/huggingface/screensuite ScreenSpot– Original multi-platform GUI grounding benchmark (mobile, desktop, web).https://llm-stats.com/benchmarks/screenspot ScreenSpot https://llm-stats.com/benchmarks/screenspot ScreenSpot-v2– Updated GUI grounding benchmark with cleaner labels and broader coverage.https://huggingface.co/datasets/Voxel51/ScreenSpot-v2 ScreenSpot-v2 https://huggingface.co/datasets/Voxel51/ScreenSpot-v2 ScreenSpot-Pro– High-resolution professional GUI grounding benchmark (23 apps, 5 industries, 3 OSes).https://arxiv.org/abs/2504.07981 ScreenSpot-Pro https://arxiv.org/abs/2504.07981 GroundUI / GroundUI-1K– Multi-platform (web / desktop / mobile) grounding dataset with a 1K eval subset.Project / dataset:https://huggingface.co/datasets/agent-studio/GroundUI-1K GroundUI / GroundUI-1K https://huggingface.co/datasets/agent-studio/GroundUI-1K Showdown-Clicks– 5,679 human clicks from macOS desktop tasks for click prediction and low-level control.https://huggingface.co/datasets/generalagents/showdown-clicks Showdown-Clicks https://huggingface.co/datasets/generalagents/showdown-clicks WebClick– 1,600+ web screenshots with “click here” labels; H Company’s benchmark for web localizers.https://huggingface.co/datasets/Hcompany/WebClick WebClick https://huggingface.co/datasets/Hcompany/WebClick ScreenSuite– Hugging Face’s umbrella GUI-agent benchmarking harness covering perception + single/multi-step tasks.https://github.com/huggingface/screensuite ScreenSuite https://github.com/huggingface/screensuite Layer 2 – Web-based agents Mind2Web (offline)– 2,350 tasks across 137 real websites and 31 domains with action sequences.https://osu-nlp-group.github.io/Mind2Web/ Online Mind2Web– 300 tasks on 136 live websites; public leaderboard for web agents on real sites.https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard Mind2Web 2– 130 long-horizon, real-time browsing tasks with an Agent-as-a-Judge framework.https://osu-nlp-group.github.io/Mind2Web-2/ WebArena– Self-hosted “mini-web” of realistic mock sites with a benchmark for functional task completion.https://webarena.dev/ REAL Bench (REAL) – AGI, Inc.’s “mini-Internet” of replicated major sites with programmatic rewards and rubric-based judging.

Blog post: https://www.theagi.company/blog/introducing-real-bench Leaderboard / evals: https://www.realevals.xyz

Web Bench– 5,570 tasks across 452 high-traffic live sites; Halluminate’s large-scale browser-agent benchmark.GitHub:https://github.com/Halluminate/WebBench Westworld– Suite of highly realistic browser simulators with verifiable rewards for web-agent benchmarking.Blog post:https://halluminate.ai/blog/westworld WebVoyager– Benchmark of tasks on dynamic live websites for end-to-end web navigation agents.https://arxiv.org/abs/2401.13919 WARC-Bench– Web-archive–based benchmark of 438 GUI subtasks on dynamic, realistic archived webpages (via Web ARChive files).https://arxiv.org/abs/2510.09872 Mind2Web (offline)– 2,350 tasks across 137 real websites and 31 domains with action sequences.https://osu-nlp-group.github.io/Mind2Web/ Mind2Web (offline) https://osu-nlp-group.github.io/Mind2Web/ Online Mind2Web– 300 tasks on 136 live websites; public leaderboard for web agents on real sites.https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard Online Mind2Web https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard Mind2Web 2– 130 long-horizon, real-time browsing tasks with an Agent-as-a-Judge framework.https://osu-nlp-group.github.io/Mind2Web-2/ Mind2Web 2 https://osu-nlp-group.github.io/Mind2Web-2/ WebArena– Self-hosted “mini-web” of realistic mock sites with a benchmark for functional task completion.https://webarena.dev/ WebArena https://webarena.dev REAL Bench (REAL) – AGI, Inc.’s “mini-Internet” of replicated major sites with programmatic rewards and rubric-based judging.

Blog post: https://www.theagi.company/blog/introducing-real-bench Leaderboard / evals: https://www.realevals.xyz REAL Bench (REAL) Blog post: https://www.theagi.company/blog/introducing-real-bench Leaderboard / evals: https://www.realevals.xyz Blog post: https://www.theagi.company/blog/introducing-real-bench https://www.theagi.company/blog/introducing-real-bench Leaderboard / evals: https://www.realevals.xyz https://www.realevals.xyz Web Bench– 5,570 tasks across 452 high-traffic live sites; Halluminate’s large-scale browser-agent benchmark.GitHub:https://github.com/Halluminate/WebBench Web Bench https://github.com/Halluminate/WebBench Westworld– Suite of highly realistic browser simulators with verifiable rewards for web-agent benchmarking.Blog post:https://halluminate.ai/blog/westworld Westworld https://halluminate.ai/blog/westworld WebVoyager– Benchmark of tasks on dynamic live websites for end-to-end web navigation agents.https://arxiv.org/abs/2401.13919 WebVoyager https://arxiv.org/abs/2401.13919 WARC-Bench– Web-archive–based benchmark of 438 GUI subtasks on dynamic, realistic archived webpages (via Web ARChive files).https://arxiv.org/abs/2510.09872 WARC-Bench https://arxiv.org/abs/2510.09872 Layer 3 – Full computer / multi-app use OSWorld– 369 multimodal computer-use tasks on real Ubuntu / Windows / macOS apps and file I/O.Site:https://os-world.github.io OSWorld-Human / OSWorld-Verified– Efficiency-focused extensions with human trajectories and cleaned harnesses.OSWorld-Human:https://mlsys.wuklab.io/posts/oshuman/ CUB (Computer Use Benchmark) – Theta’s cross-vertical benchmark for long-horizon desktop + browser workflows (“Humanity’s Last Exam for Computer and Browser Use Agents”).

Blog post: https://thetasoftware.com/blog/introducing-cub/ Announcement: https://x.com/trytheta/status/1923169553497866568

SCUBA (Salesforce Computer Use Benchmark) – ~300 Salesforce CRM workflows across admin / sales / service personas in sandbox environments: https://sfrcua.github.io/SCUBA/ OSWorld– 369 multimodal computer-use tasks on real Ubuntu / Windows / macOS apps and file I/O.Site:https://os-world.github.io OSWorld https://os-world.github.io OSWorld-Human / OSWorld-Verified– Efficiency-focused extensions with human trajectories and cleaned harnesses.OSWorld-Human:https://mlsys.wuklab.io/posts/oshuman/ OSWorld-Human / OSWorld-Verified https://mlsys.wuklab.io/posts/oshuman/ CUB (Computer Use Benchmark) – Theta’s cross-vertical benchmark for long-horizon desktop + browser workflows (“Humanity’s Last Exam for Computer and Browser Use Agents”).

Blog post: https://thetasoftware.com/blog/introducing-cub/ Announcement: https://x.com/trytheta/status/1923169553497866568 CUB (Computer Use Benchmark) Blog post: https://thetasoftware.com/blog/introducing-cub/ Announcement: https://x.com/trytheta/status/1923169553497866568 Blog post: https://thetasoftware.com/blog/introducing-cub/ https://thetasoftware.com/blog/introducing-cub/ Announcement: https://x.com/trytheta/status/1923169553497866568 https://x.com/trytheta/status/1923169553497866568 SCUBA (Salesforce Computer Use Benchmark) – ~300 Salesforce CRM workflows across admin / sales / service personas in sandbox environments: https://sfrcua.github.io/SCUBA/ SCUBA (Salesforce Computer Use Benchmark) https://sfrcua.github.io/SCUBA/ Cross-layer / general agent benchmarks mentioned GAIA – Benchmark for General AI Assistants (450 real-world questions across three difficulty levels requiring tools, browsing, and multimodal reasoning): https://arxiv.org/abs/2311.12983 GAIA – Benchmark for General AI Assistants (450 real-world questions across three difficulty levels requiring tools, browsing, and multimodal reasoning): https://arxiv.org/abs/2311.12983 GAIA https://arxiv.org/abs/2311.12983 Ben Anderson’s blog post “Computer-Use Evals are a Mess” https://benanderson.work/blog/computer-use-benchmarks/ https://benanderson.work/blog/computer-use-benchmarks/ Disclaimer: I am currently working at Theta Disclaimer: I am currently working at Theta