Interaction Collapse Is RL’s Quiet Failure Mode

Why agents keep giving up

Reinforcement learning for vision and video models seems straightforward in theory. Train a system to use tools, reward it when it solves tasks, and it should learn to reason through complex visual understanding problems. But something unexpected happens in practice. Models learn to stop trying.

This phenomenon, called interaction collapse, emerges quietly. Instead of using vision tools to examine different regions of an image or sampling strategic frames from a video, agents default to direct answers with minimal reasoning chains. They skip the scaffolding. They avoid multi-turn interactions. The reward signal pushes them toward solving tasks the cheapest way possible, which often means abandoning the very tools and reasoning patterns we're trying to instill.

The consequences are cascading. When agents refuse to reason through problems step-by-step, they need to process entire visual contexts in one go, exploding computational costs. A model that collapses into lazy reasoning also collapses your ability to scale the system. PyVision-RL tackles this by reframing how we reward interaction itself, making persistence pay off in ways that push against the natural tendency to take shortcuts.

The rational appeal of shortcuts

Current reinforcement learning approaches for multimodal models typically structure rewards around final outcomes: full reward for correct answers, zero for intermediate steps. This creates an incentive structure that silently punishes thoughtfulness. If a model can guess correctly without invoking tools, it saves computational overhead and maximizes its score. From an optimization perspective, the model is behaving rationally.

The problem compounds because most large vision-language models come pre-trained with strong instincts to answer directly. They've seen billions of image-text pairs in pretraining where the task was to caption or describe without external tools. Fine-tuning them with weak incentives for tool use is like trying to redirect an already-trained system through marginal pressure. The pre-training weights tug harder than the new signal.

Some researchers have tried workarounds. Oversampling examples of multi-step reasoning in training data increases their relative frequency. Filtering training trajectories to emphasize cases where tool use helped ensures models see positive examples. These approaches improve performance, but they're addressing symptoms rather than root causes. They increase how often the model sees good tool-use patterns, but they don't change the underlying reward landscape that makes shortcuts optimal in the first place.

Redesigning what gets rewarded

The breakthrough in PyVision-RL is realizing that interaction collapse happens because models rationally optimize reward, so the solution is to make tool use cumulatively rewarding. Not as a one-time bonus at the end, but as something that compounds at each step.

The framework combines three components that work together. First, it oversamples longer reasoning chains during rollout collection. Rather than treating all trajectories equally, the training process deliberately weights longer interaction sequences more heavily. Second, it filters out trajectories where tool use clearly harmed performance, preventing the model from learning that all tool invocation is good. Third, it ranks remaining trajectories by quality, so within the set of multi-turn reasoning examples, the model learns from the best ones.

This curation shapes what the model learns to imitate. You're not showing it random examples of what happened during exploration. You're deliberately designing a training diet of "trajectories where the agent kept trying and it paid off." The signal becomes: persistent reasoning works.

The crucial innovation sits in the reward function itself. Rather than binary signals, PyVision-RL assigns partial rewards for each tool invocation. Every time the model calls a tool during reasoning, it receives explicit value. Task completion gets additional reward on top. This flips the incentive structure completely. Multi-step reasoning is now strictly higher-reward than shortcuts. A correct answer achieved through three tool calls pays more than a correct answer via one call, because each step in the reasoning chain receives value.

This reshapes what the optimal policy looks like. The model no longer faces a choice between "quick guess" and "careful reasoning" where both lead to similar rewards. Now careful reasoning is incentivized directly. The computational cost of using tools is offset by explicit reward signal. The model discovers that the highest-reward strategies involve sustained engagement with the problem.

Making reasoning tractable at scale

But solving interaction collapse creates a new problem. If the model learns to keep using tools, won't that explode your computational budget?

Video understanding is expensive by nature. A video with many frames at reasonable resolution, converted to visual tokens, consumes enormous context. Traditional approaches force a choice: process everything upfront, which is computationally heavy, or provide fixed visual context, which limits the reasoning process to what you preloaded. If you train a model to use tools intelligently but those tools can only access static, pre-processed context, you haven't actually solved anything.

PyVision-Video introduces on-demand context construction. Rather than pre-processing an entire video into tokens, the system lets the reasoning process determine which frames to examine. When the model's reasoning indicates it needs visual information about a specific moment, that frame is loaded and encoded into tokens at that point. The video is always available, but tokens are only created for frames the reasoning process deems relevant.

This is efficient because most video reasoning doesn't require every frame. It requires the right frames. A model trained with cumulative tool rewards naturally learns to request relevant information because doing so is rewarded. The incentive structure and the architecture align: the model's learned behavior of using tools intelligently becomes practical because tools (frame requests) are now tractable.

The efficiency gains are substantial. By selectively sampling task-relevant frames during reasoning, visual token usage drops significantly compared to systems that process entire videos in one pass. For long-form video understanding, this isn't marginal improvement; it's transformative. You can handle much longer sequences within the same computational budget.

The complete training framework

PyVision-RL uses a unified training pipeline for both image understanding and video understanding. The core loop is consistent: collect rollouts where the model interacts with visual content, generate multiple trajectory samples, filter and rank them based on the strategy described above, then train using the redesigned cumulative reward signal. The image version (PyVision-Image) and video version (PyVision-Video) share this framework, with the primary difference being that video leverages on-demand frame construction.

Unifying the pipeline matters conceptually and practically. It means insights from training one modality inform the other. The approach isn't hand-tuned to videos or images. The core insights about interaction collapse and cumulative rewards are architectural principles that transfer.

A persistent risk with reinforcement learning systems is degenerate solutions, where the model finds shortcuts that technically maximize reward but lack generality. The filtering step helps prevent this by removing trajectories where tool use clearly harmed downstream performance. Equally important is that the cumulative reward structure creates a broad plateau of good solutions rather than a sharp peak. Multiple reasoning paths can be similarly rewarded if they involve thoughtful tool use, which stabilizes training and prevents the system from overfitting to a single solution pattern.

Related work on improving visual extraction capabilities has similarly emphasized the importance of intermediate steps in vision reasoning, though without the specific focus on preventing interaction collapse through reward design. The perspective here extends those findings by directly addressing why models would abandon such steps in the first place.

Implications for open and scalable AI

The paper's focus on open-weight models isn't incidental. Proprietary systems often have access to expensive training procedures and massive compute budgets. Open models need to work within realistic constraints. PyVision-RL is designed around this reality: it improves efficiency through on-demand processing and doesn't require prohibitive compute budgets for rollout collection or training.

This matters for scaling. The demonstrations show strong performance on various benchmarks, but more importantly, they show that reasoning depth doesn't require proportional increases in compute. Agents that reason more thoroughly don't necessarily consume more tokens. That's the practical result: you can build more capable systems without hitting hard efficiency walls.

The approach connects to broader research on dynamic tooling in agentic vision systems, which similarly emphasizes letting models make decisions about what visual information to process. PyVision-RL provides concrete mechanisms for training such systems to actually make good decisions about tool use rather than abandoning tools entirely.

Looking forward, the framework opens questions about how far you can extend it. Can these methods work on genuinely long-horizon reasoning tasks that require sustained interaction over dozens of steps? What happens if you combine this with other efficiency techniques? Does the approach generalize beyond vision to other modalities? The foundation is now stable enough that future work can push in different directions without the foundational blocker of interaction collapse.

The deepest insight doesn't live in technical details. It's that training AI systems is partly about showing them data or optimizing algorithms, but partly about incentive structure. By redesigning what gets rewarded, you change what agents learn to value. When you make persistence rewarding and reasoning efficient, agents naturally become more capable and more genuinely agentic. That principle extends far beyond multimodal models. It's a lesson about how to align what you're training toward what you actually want.

This is a Plain English Papers summary of a research paper called PyVision-RL: Forging Open Agentic Vision Models via RL. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.