The Missing Data Problem Behind Broken Computer-Use Agents

Why computer-use agents fail at professional software

Computer-use agents should be useful by now. We have foundation models that understand images and reason about text. We have the infrastructure to make them click buttons and type. Yet when deployed on real professional desktop applications, they fail roughly 60% of the time. The gap between capability and performance suggests something fundamental is missing from their training.

The bottleneck isn't intelligence. It's data, specifically the kind of data we've been using to train these agents. The largest open dataset available, ScaleCUA, contains 2 million screenshots. Spread across hours of interaction, that's sparse by any measure. You could capture that many screenshots from 20 hours of real desktop work. We've been training agents on a photo album when we need film.

But it's not just a quantity problem. Sparse screenshots actively hide information crucial for learning. When you show only isolated moments, an agent never learns why a human moved the cursor in certain ways, how they recovered from mistakes, or what micro-decisions separated success from failure. You've reduced the learning signal from a rich behavioral stream to disconnected keyframes.

CUA-Suite addresses this bottleneck by providing exactly what we've been missing: 55 hours of continuous, human-demonstrated professional computer use. The core insight is deceptively simple. Computer-use agents need to learn from video, not just from static images, because temporal continuity is where the real knowledge lives.

The information hidden in motion

Recent work emphasizes that continuous video is the critical missing ingredient for scaling these agents. But understanding why requires looking at what video teaches that screenshots cannot.

When you train on isolated image-action pairs, you're treating every click as independent. But humans don't work that way. The cursor doesn't teleport between positions; it follows a path. That path communicates intent. Slow, deliberate movements signal careful deliberation. Sharp diagonals suggest executed patterns. Fast corrections indicate error recovery.

Consider resizing a dialog box. The critical signal is the continuous mouse trajectory while dragging. Static frames give you "mouse at position A, then at position B." Video shows velocity profile, acceleration, micro-corrections, and the exact moment resistance appears. These details encode knowledge about what friction means, when to pause, when to commit.

Beyond cursor dynamics, sparse data fails to capture failure and recovery entirely. Humans don't always succeed on the first attempt. They overshoot, undo, backtrack, retry. Each failure contains information about what went wrong and how to recognize it. Sparse data deletes these sequences.

Professional software adds another layer of invisibility. A single screenshot from Krita, FreeCAD, or Inkscape is visually dense. Without video context showing how humans navigated to that state, an agent is lost. Where to look first? What sequence of UI elements led here? Video answers these questions implicitly through the flow of attention.

Continuous video is a strict superset of information. Every useful screenshot can be extracted as a keyframe from video without loss. But video additionally captures temporal sequences showing which UI elements humans examine first, cursor dynamics revealing confidence and expertise, micro-interactions happening too fast for stills to capture, failure and recovery patterns, and the rhythm of professional work itself.

This creates an elegant property: you can always extract sparse screenshots or click sequences from continuous video without information loss. But the reverse is impossible. A video dataset remains forward-compatible with any future agent architecture or learning approach. You're building the most general resource possible.

Inside CUA-Suite

The dataset is organized around three complementary resources, each answering different research questions.

VideoCUA is the core. It contains approximately 10,000 human-demonstrated tasks across 87 diverse applications with continuous 30 fps screen recordings, totaling 55 hours and 6 million frames. But raw video alone doesn't train agents. The dataset includes dense annotations that transform video into training material.

Kinematic cursor traces provide the exact pixel coordinates of the cursor at every frame, along with velocity and acceleration vectors. This is how you teach agents to move like humans, not like mechanical robots with unlimited precision.

Keyframe annotations mark the visually and semantically distinct moments within task videos. Rather than labeling every frame, human experts identified transition points between major steps, decision points, and state changes. This creates a natural hierarchy from dense video down to sparse keyframe structure, letting researchers work at whatever granularity their approach requires.

Multi-layer reasoning annotations set CUA-Suite apart. For many tasks, annotators included explanations of why they took specific actions: "I'm clicking the dropdown to see options," "I'm hovering to check menu expansion," "I made an error and need to undo." These annotations bridge low-level actions and high-level reasoning, letting agents learn not just what humans do but why they do it.

The diversity of applications is crucial. The dataset spans professional design software (Krita, Inkscape, FreeCAD), media production (OBS Studio), IDEs, office suites, and more. Agents trained on limited software learn assumptions about UI conventions that don't generalize. A web browser trains you to expect clickable elements in specific styles, scrollbars on the right, navigation at the top. FreeCAD's tree-based selection breaks these assumptions. Krita's multi-panel layout and context-sensitive tools require rethinking how to find and activate functions. This diversity forces agents to learn fundamental GUI principles rather than superficial patterns.

CUA-Suite overview showing how human GUI trajectories are recorded across desktop platforms, verified by experts, and annotated with keyframes, bounding boxes, and interaction logs

UI-Vision is the evaluation benchmark. You can't improve what you can't measure. This benchmark is specifically designed to test spatial understanding and planning capabilities, using different applications and tasks than VideoCUA so evaluation is genuinely out-of-distribution. It includes complex multi-step tasks requiring foresight, adversarial examples designed to catch common failure modes, and evaluation metrics that go beyond binary success/failure to measure partial progress and directional understanding.

GroundCUA addresses the fundamental subproblem of visual grounding. Can the agent even see the UI correctly? This dataset contains 56,000 annotated screenshots with over 3.6 million bounding boxes around UI elements, each labeled semantically as button, text field, slider, menu item, etc. Current vision models trained on natural images misidentify what's clickable in professional software. A button in Krita looks nothing like a button in FreeCAD, yet both are actionable. GroundCUA teaches agents to recognize interactive elements across diverse software styles. This separation lets agents first ground themselves in the UI, then reason about what actions to take.

Why professional software breaks current agents

Current foundation models achieve roughly 60% task failure rate on professional desktop applications. The gap isn't stupidity; it's unfamiliarity with how professional software works.

Professional applications violate several assumptions that agents internalize from web browsing. In web interfaces, most actions are visible. Buttons are on screen, menus appear when clicked. Professional software buries functionality behind hierarchical menus, tool panels, context-dependent toolbars, mode switches. A button you need might not exist until you select a different tool or change the document mode. Agents trained on web discoverability expectations struggle when it vanishes.

Web interfaces also maintain consistent interaction patterns. Clicking activates, dragging moves, scrolling navigates lists. Professional software uses mode switches where the same gesture has completely different meanings depending on context. In FreeCAD's tree, clicking selects an object. In the 3D viewport, clicking rotates the view. Agents see inconsistency and fail.

On the web, visual salience usually equals importance. Large, colorful buttons matter. Professional software inverts this. A tiny toolbar icon might be more important than large explanatory panels. Agents learn to attend to visually prominent regions but professional software requires understanding functional hierarchy.

The evaluation results reveal specific, systematic failures. In Krita, agents become confused about which UI element represents which function, clicking on color swatches thinking they're brushes and vice versa. They lack semantic understanding connecting specific elements to specific roles. In FreeCAD, they apply toolbar operations without first selecting objects in the tree, missing the semantic dependency that selection must precede modification. In Inkscape, they navigate menus slowly when toolbar clicks would suffice, never internalizing that these paths are functionally equivalent but have different costs. In OBS Studio, the overwhelming number of clickable elements causes confusion, with agents misunderstanding which parameters affect which broadcast settings.

Krita: agents become confused between panels with similar-looking interactive elements

Krita: agents struggle to distinguish between UI elements across different panels

FreeCAD: agents apply toolbar operations without understanding object selection dependencies

FreeCAD: agents skip the required object selection step before applying operations

Inkscape: agents don't recognize functionally equivalent UI paths

Inkscape: agents fail to choose efficient UI paths when multiple equivalent options exist

OBS Studio: agents confuse parameters across multiple overlapping panels

OBS Studio: the abundance of similar UI elements causes systematic misidentification

Each failure has a common thread: the agent hasn't learned the semantic structure of the interface. It sees clickable elements but doesn't understand their functional roles, how they relate to each other, or which operations depend on prior selections. Raw visual matching fails when interfaces are unfamiliar.

What becomes possible now

CUA-Suite's rich multimodal corpus supports emerging research directions beyond just training better agents. The continuous video data enables research into generalist screen parsing, understanding and categorizing UI elements regardless of application. Video-based reward modeling becomes tractable when you have thousands of hours of expert demonstration showing which actions are good and which are wasteful. The dense frame data supports research into visual world models that predict how screens change in response to actions.

The forward-compatible design of continuous video means this dataset will remain useful as agent architectures evolve. The grounding dataset addresses a separate but crucial problem: making sure agents understand what they're looking at before they reason about what to do. The benchmark provides a rigorous way to measure progress on genuinely difficult, out-of-distribution tasks.

This combination addresses what earlier work on accessibility-aware computer-use identified as crucial gaps in how agents interact with diverse user interfaces. By providing diverse, densely annotated professional software demonstrations, CUA-Suite gives agents the raw material to develop robust understanding rather than superficial pattern matching.

The bottleneck for scaling computer-use agents was real. We were trying to teach machines how to use computers from sparse, disconnected examples. CUA-Suite provides what was actually missing: continuous records of how humans navigate professional software, annotated deeply enough to extract both immediate behavioral signals and the reasoning behind them. The 60% failure rate on professional applications suggests there's substantial room for improvement as agents train on this richer data. But more importantly, CUA-Suite establishes the standard for what good training data for computer-use agents actually looks like.

This is a Plain English Papers summary of a research paper called CUA-Suite: Massive Human-annotated Video Demonstrations for Computer-Use Agents. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.