If you’ve seen "computer-use agents", you've noticed two facts: Cada nuevo modelo es "SOTA" en algo. 2. Almost none of those numbers line up. , , de , , , , , de , ... y una docena de líderes de ventas. OSWorld CUB, Web Bench Westworld REAL Mind2Web ScreenSpot GroundUI Showdown-Clicks WebClick It feels more and more like early web frameworks. Too many options and not enough direction. Este post es un intento de poner el ecosistema actual en una imagen coherente: qué hay allí, cómo difieren los parámetros de referencia y hacia dónde va todo esto. The three layers of "Computer-Use" Almost every "computer-use" benchmark falls into one of three layers: 1. – Localizing and identifying interface elements from screenshots Low-level UI grounding 2. – Multi-step task completion within browser environments Web task execution 3. Flujos de trabajo de aplicaciones cruzadas en sistemas operativos completos Full OS / multi-app computer use Layer 1 – UI Grounding Estos parámetros toman una captura de pantalla y una instrucción y piden al modelo que apunte al lugar correcto (pixel, caja o elemento de interfaz de usuario). Core examples include the , which serves as the workhorse of GUI grounding. The original cubre UI web, móvil y de escritorio; Limpiar el etiquetado; Aplicaciones profesionales de alta resolución en múltiples industrias y sistemas operativos. ScreenSpot family ScreenSpot ScreenSpot-v2 ScreenSpot-Pro adopta un enfoque diferente mediante el mashing , de , y amigos en un conjunto de datos multiplataforma de ejemplo ~18k, más un subconjunto estándar de eval de ejemplo 1k. GroundUI ScreenSpot Mind2Web OmniACT offers 5,679 human clicks from people doing tasks in a macOS desktop environment, used as a click-prediction benchmark. Showdown-Clicks Mientras tanto, from proporciona más de 1.600 capturas de pantalla web con etiquetas "clic aquí", utilizadas por Holo1/Holo1.5 para mostrar la localización de la interfaz de usuario de pequeño modelo. WebClick H Compañía If you're training the "eyes" of an agent (a Vision-Language Model that can read screens and pick widgets), the benchmark is here. Almost every GUI agent paper now reports / / / de los números. ScreenSpot ScreenSpot-Pro GroundUI Showdown-Clicks Layer 2 – Agentes basados en la web Here, the agent gets an actual browser (or a high-fidelity simulator) and has to complete tasks like "During the summer, book a hotel in New York City under $250" or "find the return policy for this product and make a return request for my most recent item.” El family dominates this space. The offline dataset contains 2,350 tasks across 137 real websites and 31 domains, with action sequences. is the live equivalent: 300 tasks on 136 real websites, with a leaderboard that tracks accuracy, cost, and runs. extends this with 130 long-horizon, research-style search tasks and adds "agent-as-a-judge" for answer correctness and attribution. Mind2Web Online Mind2Web Mind2Web 2 Tiene un enfoque diferente: es un entorno web autogestionado construido a partir de sitios de mock realistas (comercio electrónico, foros, reposo de estilo GitLab, CMS, etc.) con cientos de tareas que imitan tareas web cotidianas. from offers 112 tasks across replicas of major sites like Amazon and DoorDash, with separate reward functions for "did you get the right info?" and "did you take the right actions?" WebArena REAL AGI, Inc. de focus on scale: 5.750 tareas en 452 sitios reales, mientras que is a much smaller suite of realistic browser synthetic simulators with verifiable rewards. Web Bench & Westworld aluminado Web Bench Westworld Finally, defined tasks on 15 popular live websites, plus an automatic evaluation protocol using GPT-4V to judge open-ended behavior. WebVoyager Los agentes basados en la web han crecido en popularidad por su promesa en la automatización de tareas debido al espacio de acción siendo más pequeño que la próxima capa, el uso de ordenador de sistema operativo completo. Layer 3 – Full computer use La capa final da al agente un sistema operativo completo: múltiples aplicaciones, sistema de archivos, copia-pasta, etc. sirve como anclaje aquí, con 369 tareas en máquinas reales Ubuntu / Windows / macOS que abarcan navegadores, aplicaciones de Office, exploradores de archivos, IDEs, correo electrónico, reproductores de medios y más. . The Las extensiones proporcionan una armadura limpia y trayectorias humanas para todas las tareas, lo que le permite medir no solo si el agente tiene éxito, sino cuántos pasos y cuánto tiempo se quema en comparación con los humanos. OSWorld Humans hit ~72% success; early best agents were around 12% OSWorld-Verified & OSWorld-Human from es a for long-horizon desktop + browser workflows. Leading AI agent companies like display the leaderboard scores alongside numbers from , a general AI agent benchmark with a few browser workflows. CUB (Computer Use Benchmark) Theta Benchmark transversal Manus AI CUB GAIA from Salesforce takes a different approach: it's a Salesforce-internal benchmark built from ~300 real CRM workflows covering admin, sales, and service tasks. Their approach is to take a deeply verticalized enterprise SaaS view of the benchmark. SCUBA This final layer feels closest to an agent acting as a knowledge worker to the fullest. Accordingly, it is also the most difficult layer by far. Agents often perform poorly on these benchmarks (often ) debido a los entornos variados y los casos de borde en un entorno de sistema operativo completo. low double-digit success rates Herencia > Modelo Ben Anderson's makes a brutal but fair point: a lot of "SOTA" is actually prompt engineering plus scaffolding. Posts sobre el uso de computadoras evaluaciones On popular benchmark Por ejemplo, los documentos originales . Showdown-Clicks ~20% accuracy for a big off-the-shelf model while small finetuned models get ~70–80% Ben finds that gets the score of a mere ~20%. But then he swaps in a much simpler "click-only" XML prompt and sees his small jump to around 50% on the exact same benchmark. Here is the short prompt Ben used for the 250% increase in score despite the much smaller model: Qwen’s 72B model Modelo 3B Qwen Determine where to click in the UI to complete the instruction/task. Report the click location in XML like '<points x1="x" y1="y">click</points>.' Image size [{w}, {h}], (0, 0) is the top-left and ({w}, {h}) is the bottom-right. [IMAGE] {instruction} Similar stories show up elsewhere. utiliza sus propias funciones de arnés y recompensa para tareas de información y acción. explicitamente advierte de que su configuración sólo de visión significa -style scores aren't directly comparable to DOM-based agents. REAL ScreenSuite Mind2Web For computer-use benchmarks today, a sizeable chunk of the performance gap you see on leaderboards is harness (prompts, tools, termination rules, retries, judges), not model weights. If you're comparing numbers across papers without looking at scaffolding, you're mostly reading marketing. Convergence to a small set of "anchor" benchmarks Despite the chaos, you can already see the field standardizing around a few anchors. For the grounding layer: (En el caso de los pro), , de , and . For the web layer: the trio of (offline + online + v2), plus and one of / . For the OS layer: (plus Verified and Human variants), , and Por encima de eso, de Hugging Face actúa como un arnés de paraguas que envuelve muchos de estos en un marco. ScreenSpot GroundUI WebClick Showdown-Clicks Mind2Web WebArena Web Bench WebVoyager OSWorld CUB SCUBA ScreenSuite Any "computer-use agent" release is normally expected to report 1–2 grounding scores ( , , , ), 1–2 web scores ( , , de ), y 1–2 puntos OS ( , , ). ScreenSpot-v2/Pro GroundUI WebClick Showdown-Clicks Online Mind2Web, Web Bench REAL Westworld OSWorld-Verified CUB SCUBA The shift from measurement to production Los benchmarks tempranos simplemente preguntaron "éxito o fracaso". shows that even strong agents take more steps than humans on these tasks; some trivial actions (like reformatting text) take agents minutes where a human needs seconds. Determina los costos (API) y la fiabilidad de las operaciones. exposes multiple reward functions and emphasizes robustness across different scaffolds. The scoreboard is moving from single numbers ("accuracy") to profiles (“capability”, “reliability”, “cost”, “latency”). OSWorld-Human 1.4 – 2,7× Online Mind2Web REAL The fundamental shift from research-grade thinking to production-level may be an early indicator that the “computer-use agent” is healthily progressing. In fact, early production deployments of the “computer-use agent” from En un blog reciente, el laboratorio compartió showing Nova Act handle workflows in the enterprise such as complex form filling and long administrative processes. Nueva Ley Amazon AGI’s SF lab customer stories ¿Dónde se encuentran las llamadas “marcas”? from is a single screenshot-driven agent that reports numbers on and , spanning all three layers. UI-TARS ByteDance ScreenSpot-Pro OSWorld H Company specializes in grounding and shows results on , de , de , , and its very own El benchmark. ScreenSpot-v2 ScreenSpot-Pro GroundUI-Web Showdown-Clicks WebClick AGI, Inc. focuses on the web and OS layers via their own and the established Los líderes. REAL OSWorld Theta concentrates on the OS and browser layer via . CUB Benchmarks doubled as go-to-market channels Many of these benchmarks also act as distribution and data engines. AGI, Inc. built and then an plus agents around it; being "#1 on REAL" is both a research claim and a funnel into their product. Theta's is positioned as "Humanity's last exam for computer use agents.” Halluminate uses and as both benchmarks and infrastructure for running browser agents at scale. REAL SDK CUB Westworld Web Bench Benchmarks are becoming part measurement, part distribution, and part data flywheel. If you're picking which ones to invest in, you're also picking which ecosystems you want to plug into. The shift from live sites to synthetic sandboxes Many first-wave web benchmarks evaluated agents directly on live sites. and run tasks on real, changing webpages from over 100 popular sites. y De la misma manera, utiliza tareas en sitios web reales como Amazon, Apple, Google Flights y cientos de otros dominios de alto tráfico. Esto da realismo, pero hace que la evaluación sea frágil: los sitios cambian, los DOMs se desplazan y las señales de recompensa automáticas confiables son difíciles de mantener a escala. Mind2Web Online Mind2Web WebVoyager Web Bench La alternativa emergente es los entornos sintéticos de alta fidelidad con recompensas programáticas incorporadas. proporciona una “mini web” auto-hostada de sitios totalmente funcionales (e-commerce, foros, herramientas de proyecto, CMS) cuyo estado es plenamente observable y reproducible. positions itself as “Humanity’s Last Exam for Computer and Browser Use Agents,” highlighting the complexity of tasks that can be made in these realistic environments. (from AGI, Inc.) builds deterministic replicas of 11 widely used websites and evaluates agents via programmatic state checks plus rubric-based judging. Halluminate’s ofrece un “internet totalmente simulado” de entornos de navegador para flujos de trabajo económicamente significativos, complementando sus benchmark on live sites. In fact Halluminate’s first benchmark was used on live sites and they moved to doing private synthetic sites in , their most recent benchmark. Moreover, goes further by Funciones de recompensa programática. WebArena CUB REAL Westworld Web Bench Web Bench Westworld WARC-Bench grabación de páginas web dinámicas y realistas en archivos web ARCHive interactivos Una simulación de Amazon o un sitio de vuelos puede perder casos raros que verías en la web real, y hay un interés activo en estudiar la brecha “sim-to-real”, . But in return, these sandboxes offer stable tasks, precise ground truth, and safe, massively parallel evaluation. for example by comparing Westworld-style simulators with tasks on real Google Flights Given this, the trajectory is clear: live-web benchmarks remain essential for checking real-world performance, but the center of gravity for day-to-day agent evaluation is moving toward realistic, instrumented sandboxes with explicit reward functions and full observability. Especially as there is a shift towards private websites for enterprise use cases. Cómo usar esto si usted es un agente de construcción If you're trying to ship an agent, here's a pragmatic checklist. For all evaluations, avoid creating custom harnesses optimized for a single benchmark. To ensure meaningful results beyond launch announcements, use established public harnesses and document your implementation choices. Now onto the specific patterns per agent type: If you're building a GUI-aware model Your priorities should be to train on + + style data, then report on / / / / , ideally via the harness where applicable for standardization. You're optimizing for localization accuracy and robustness to varied UI skins. ScreenSpot GroundUI WebClick ScreenSpot-v2 ScreenSpot-Pro GroundUI-1K WebClick Showdown-Clicks ScreenSuite If you're building a web agent Start with (offline) para desactivar el comportamiento básico. + for live behavior and cost curves. Consider (real web, wide coverage) and / (ambientes auto-hostados, simulados pero realistas) una vez que se preocupe por el cambio de distribución y la robustez. su estrella del norte se convierte en: tasa de éxito y fiabilidad y costo por tarea. Mind2Web Online Mind2Web REAL Web Bench WebArena Westworld If you're building a full “computer-use agent” Use as the standard ability check. Study to understand where you're much slower or more brittle than humans. If you're selling into enterprises, consider and relevant vertical benchmarks like . OSWorld-Verified OSWorld-Human CUB SCUBA Los índices de referencia están madurando más rápido que los agentes, pero todavía están rotos A year ago, "computer-use" benchmarks were fragmented. Today we have a more complete benchmark stack. Grounding benchmarks that stress-test vision models on every UI imaginable. Web benchmarks spanning thousands of real sites. OS benchmarks that replicate actual knowledge work. The best agents still struggle. Low success rates on . Step counts 2x longer than humans. Costs that turn deployment into a CFO problem. OSWorld But there's a deeper issue. As Anderson showed, half the performance gap on these benchmarks is scaffolding, not model quality. A 3B model with the right prompt can beat a 72B model with a naive one. The "everyone is SOTA on something" problem hasn't been solved. It's just moved from benchmark selection to harness engineering. The chaos is starting to resolve around / for grounding, / for web tasks, and / for full OS execution. But more importantly, people are catching on. When production deployments start, scaffolding tricks stop working. The benchmarks that survive will be the ones where performance actually predicts real-world behavior. ScreenSpot GroundUI Mind2Web REAL OSWorld CUB What matters now is rigor. Run the standard evals with public harnesses. The gap between benchmark performance and production reality is where all the actual work lives. The measurement infrastructure exists and will only get better. Scrutiny is coming and you should build for that world, not this one. Referencias Capítulo 1 - Terrenos ScreenSpot– Benchmark de aterrizaje original de GUI multiplataforma (móvil, escritorio, web).https://llm-stats.com/benchmarks/screenspot ScreenSpot-v2– Benchmark actualizado de aterrizaje de GUI con etiquetas más limpias y cobertura más amplia.https://huggingface.co/datasets/Voxel51/ScreenSpot-v2 ScreenSpot-Pro– Benchmark de aterrizaje de GUI profesional de alta resolución (23 aplicaciones, 5 industrias, 3 OSes).https://arxiv.org/abs/2504.07981 GroundUI / GroundUI-1K– conjunto de datos multiplataforma (web / escritorio / móvil) con un subconjunto de eval 1K.Project / dataset:https://huggingface.co/datasets/agent-studio/GroundUI-1K Showdown-Clicks– 5,679 clics humanos de tareas de escritorio de macOS para predicción de clics y control de bajo nivel.https://huggingface.co/datasets/generalagents/showdown-clicks – 1,600+ web screenshots with “click here” labels; H Company’s benchmark for web localizers. WebClick https://huggingface.co/datasets/Hcompany/WebClick ScreenSuite– Hugging Face’s umbrella GUI-agent benchmarking harness cubriendo percepción + single/multi-step tasks.https://github.com/huggingface/screensuite Layer 2 – Agentes basados en la web Mind2Web (offline)– 2.350 tareas en 137 sitios web reales y 31 dominios con secuencias de acción.https://osu-nlp-group.github.io/Mind2Web/ En línea Mind2Web– 300 tareas en 136 sitios web en vivo; tablero de liderazgo público para agentes web en sitios reales.https://huggingface.co/spaces/osunlp/Online_Mind2Web_Leaderboard Mind2Web 2– 130 tareas de navegación a largo plazo en tiempo real con un framework Agent-as-a-Judge.https://osu-nlp-group.github.io/Mind2Web-2/ WebArena – Auto-hosted “mini-web” de sitios de mock realistas con un punto de referencia para la realización de tareas funcionales.https://webarena.dev/ – AGI, Inc.’s “mini-Internet” of replicated major sites with programmatic rewards and rubric-based judging. REAL Bench (REAL) Blog post: https://www.theagi.company/blog/introducing-real-bench Leaderboard / evals: https://www.realevals.xyz Web Bench– 5,570 tareas en 452 sitios en vivo de alto tráfico; Benchmark del agente de navegador a gran escala de Halluminate.GitHub:https://github.com/Halluminate/WebBench Westworld– Suite de simuladores de navegador altamente realistas con recompensas verificables para el benchmarking de agentes web.Post del blog:https://halluminate.ai/blog/westworld WebVoyager– Benchmark de tareas en sitios web dinámicos en vivo para agentes de navegación web de extremo a extremo.https://arxiv.org/abs/2401.13919 WARC-Bench – Benchmark basado en archivos web de 438 subtascas de GUI en páginas web archivadas dinámicas y realistas (a través de archivos Web ARChive).https://arxiv.org/abs/2510.09872 Layer 3 – Uso de ordenador completo / multi-aplicación OSWorld– 369 tareas de uso de computadora multimodal en aplicaciones reales Ubuntu / Windows / macOS y archivo I/O.Site:https://os-world.github.io OSWorld-Human / OSWorld-Verified– Extensiones enfocadas a la eficiencia con trayectorias humanas y arnas limpias.OSWorld-Human:https://mlsys.wuklab.io/posts/oshuman/ – Theta’s cross-vertical benchmark for long-horizon desktop + browser workflows (“Humanity’s Last Exam for Computer and Browser Use Agents”). CUB (Computer Use Benchmark) Blog post: https://thetasoftware.com/blog/introducing-cub/ Announcement: https://x.com/trytheta/status/1923169553497866568 SCUBA (Salesforce Computer Use Benchmark) – ~300 flujos de trabajo de Salesforce CRM a través de personas administrativas / de ventas / de servicio en entornos de caja de arena: https://sfrcua.github.io/SCUBA/ Cross-layer / general agent benchmarks mentioned GAIA – Benchmark para los asistentes de IA generales (450 preguntas del mundo real en tres niveles de dificultad que requieren herramientas, navegación y razonamiento multimodal): https://arxiv.org/abs/2311.12983 Ben Anderson’s blog post “Computer-Use Evals are a Mess” https://benanderson.work/blog/computer-use-benchmarks/ Disclaimer: I am currently working at Theta