The feedback problem: why standard reinforcement learning struggles
Language models learn from feedback the way a student learns from a failed exam: they see the grade but struggle to figure out what to actually do differently. Reinforcement learning has become the dominant approach for training language models to improve on complex tasks. Models generate outputs, receive environmental feedback (did the code run? did the reasoning reach the right conclusion?), and adjust their behavior accordingly. This works beautifully when feedback is rich and immediate. But in the real world, feedback is almost always sparse and delayed.
Consider a multi-step reasoning task. The model reasons through five steps and arrives at the wrong answer. It receives a single bit of information: failure. From this thin signal, it must somehow figure out which of its earlier decisions caused the problem. Was it the first step? The third? Did the model misinterpret the question? Each component in the chain could be the culprit. Standard reinforcement learning (sometimes called RLVR, or reward learning from verification) handles this by repeating the trial-and-error loop: generate attempt, get feedback, adjust policy, repeat. But here's the limitation: each cycle treats feedback as a scalar signal. Without explicit reasoning about why something failed, the model engages in undirected exploration, often reverting previous gains or heading down dead-end paths.
A comparison of learning dynamics in RLVR and ERL. RLVR shows repeated trial-and-error cycles driven by scalar rewards, with back-and-forth oscillation and no durable correction. ERL shows a structured experience-reflection-consolidation loop leading to stable improvement.
Conceptual comparison of learning dynamics in RLVR and Experiential Reinforcement Learning (ERL). RLVR relies on repeated trial-and-error driven by scalar rewards, leading to back-and-forth exploration without durable correction.
This is where the tension lives. We've thrown increasingly powerful models at this problem, but we haven't solved the fundamental issue: models are being asked to extract structured behavioral lessons from unstructured scalar feedback. The result is inefficient learning, unstable optimization, and poor final performance on complex tasks.
The reflection insight
What if we forced the model to do what humans naturally do after failure: pause and reflect? When you fail at something and then take time to think about why you failed, something fundamental shifts. You're not just receiving feedback; you're generating an explanation. That explanation becomes a bridge between the raw failure signal and future action. You've converted passive observation into active understanding.
Experiential Reinforcement Learning (ERL) introduces an explicit reflection step into the training loop. After receiving feedback on an initial attempt, the model generates a natural language reflection that explains what happened and why. This reflection then becomes the context for a second attempt. The critical move: the success of this second attempt, informed by the model's own reasoning, is what gets reinforced back into the base policy.
Think of it as self-generated intermediate supervision. The sparse environmental feedback doesn't directly teach the policy. Instead, it triggers the model to generate rich, structured reasoning that guides a refined attempt. The RL signal then learns to value the refined attempt and, by extension, learns to generate useful reflections. This resolves a deep tension in the feedback problem. Instead of fighting sparse feedback, we're converting it into dense, task-relevant reasoning. The model doesn't just learn "you failed." It learns "you failed because [reason], so next time try [approach]." And crucially, the model is generating this reasoning itself, so it's naturally aligned with how the model thinks and learns.
The three-act loop: experience, reflection, consolidation
The ERL process unfolds in three structured stages. First, given a task, the language model generates an initial attempt. This might be a solution to a coding problem, a reasoning chain for a multi-step question, a movement sequence in a control environment, whatever the task demands. The environment evaluates this attempt and provides feedback.
Crucially, this feedback goes to the model not as a scalar reward signal to be processed in isolation, but as context for generating a reflection. The model is prompted to produce a natural language explanation of what happened: what was the mistake, why did it occur, what would be different next time? This is the critical step. The reflection transforms abstract feedback into semantic structure that the model can use.
Overview of Experiential Reinforcement Learning (ERL). Given an input task, the language model first produces an initial attempt and receives environment feedback. The same model then generates a self-reflection conditioned on the feedback, before producing a refined second attempt whose success is reinforced and internalized into the base policy.
Overview of Experiential Reinforcement Learning (ERL). Given an input task, the language model first produces an initial attempt and receives environment feedback. The same model then generates a self-reflection conditioned on the feedback, before producing a refined second attempt whose success is reinforced and internalized into the base policy.
Now comes the consolidated attempt. The model generates a second attempt, but this time it's conditioned on its own reflection about the first attempt. This second attempt, informed by structured reasoning, is much more likely to succeed. When it does, the RL signal reinforces this entire trajectory, not just the successful outcome. The model learns to generate useful reflections because doing so consistently leads to successful attempts.
This three-act structure is doing something subtle but profound. Standard RL asks: "Given the failure, adjust your policy." ERL asks: "Given the feedback, can you reason about it? Now generate a better attempt based on your reasoning." The second approach gives the model agency and structure. It's not passively receiving corrections; it's actively reasoning through problems and proving to itself that the reasoning works.
Proof across domains
The researchers evaluated ERL on three very different domains to show the approach generalizes. In sparse-reward control environments like FrozenLake and Sokoban, agents must navigate grids and solve puzzles with minimal reward signals. These are the canonical testbed for learning efficiency. In multi-step reasoning tasks like HotpotQA, models must chain together multiple reasoning steps to answer complex questions. In tool-using agentic reasoning, models must call external tools and interpret results, with sparse feedback on whether the final answer is correct.
Validation reward trajectories versus training wall-clock time on FrozenLake, HotpotQA, and Sokoban for two different model architectures. ERL consistently achieves higher reward and faster improvement than RLVR across all domains and both models.
Validation reward trajectories versus training wall-clock time on FrozenLake, HotpotQA, and Sokoban. ERL consistently achieves higher reward and faster improvement than RLVR across all domains.
The learning trajectories show a consistent pattern: ERL learns faster and reaches higher rewards than standard RL baselines. On FrozenLake and Sokoban, the improvements are substantial, reaching up to 81% gains in complex multi-step environments. On reasoning tasks like HotpotQA, improvements reach 11%. Importantly, the trajectories also track wall-clock time, which matters in practice. ERL isn't just reaching higher performance; it's doing so faster, making it a genuine practical improvement.
Final evaluation reward on FrozenLake, HotpotQA, and Sokoban for both model architectures. ERL consistently outperforms RLVR across all three domains.
Final evaluation reward on FrozenLake, HotpotQA, and Sokoban. ERL consistently outperforms RLVR for both model architectures.
The consistency across domains tells us something important: we're not just fixing a specific problem. The reflection mechanism is addressing something fundamental about how models learn from sparse feedback. Whether the task involves spatial navigation, reasoning chains, or tool use, the explicit reasoning step provides value. Related work on agent learning via early experience and reexploration in embodied environments has explored how structured self-reflection can improve learning trajectories, but ERL demonstrates this works across dramatically different problem structures.
What actually drives the improvement
The results look strong, but which parts actually drive the improvement? Is it the reflection itself, or could it be something else? An ablation study isolates the mechanism by testing variants of ERL on FrozenLake.
The first variant disables cross-episode reflection reuse. The model can still generate reflections, but it doesn't carry that knowledge across training episodes. The second variant removes structured reflection entirely, replacing it with raw environmental feedback. The model still generates a second attempt, but without the intermediate reasoning step.
Training reward trajectories comparing RLVR with ERL before and after reflection on FrozenLake. Post-reflection trajectories consistently achieve higher reward than both RLVR and pre-reflection attempts.
Training reward trajectories for ERL comparing performance before and after reflection. Post-reflection trajectories consistently achieve higher reward than both RLVR and pre-reflection attempts.
Both ablations hurt performance. The variant without memory learns more slowly than full ERL. The variant without structured reflection performs closer to standard RLVR baselines. Figure 7 confirms this pattern.
Ablation study on FrozenLake comparing full ERL with two variants: one without memory (no cross-episode reflection reuse) and one without reflection (replacing structured self-reflection with raw feedback). Full ERL outperforms both variants.
Ablation study comparing full ERL with variants removing memory and removing reflection. Full ERL outperforms both ablations, confirming both components are necessary.
These ablations confirm the core mechanism isn't just "try twice and you'll do better." The magic lives specifically in the reflection step and in the model's ability to build on previous reflections. The model has to be doing genuine reasoning work, and that reasoning has to compound across episodes. You can't phone this in.
Why this sticks at deployment
Here's a practical question: does the model need to keep reflecting once it's deployed in production? During training, the model goes through the full three-act loop: attempt, feedback, reflection, refined attempt. But at deployment, you don't need to provide environmental feedback or ask for reflection. You just need the model to generate its attempt. The reflection was part of the training process, not something that has to happen at inference time.
This is where ERL gets clever. Many RL approaches require additional inference steps or external tools at deployment that slow things down or add complexity. ERL's gains come entirely from a better training signal. Once trained, the model is just as efficient to run as the baseline.
The model isn't learning to reflect at deployment. It's learning better policies during training because reflection forced it to reason through problems. The refined second attempts, informed by reflection, contain implicit knowledge about reasoning and error correction. That knowledge gets absorbed into the base policy weights. At deployment, the model has simply learned better heuristics for getting things right the first time.
This is the bridge between research and practice. Many elegant training procedures fall apart when you try to deploy them because they require overhead that's not acceptable in production. ERL sidesteps this entirely. The training cost is worth it because you get a better final model at no additional deployment cost.
From feedback to durable learning
The core contribution of this work is a reconceptualization of what reinforcement learning feedback actually accomplishes. Instead of treating feedback as a direct correction signal, ERL treats it as a trigger for reasoning. The model explains what happened, uses that explanation to guide a refined attempt, and learns from the success of that attempt.
This transforms sparse, delayed feedback into structured behavioral revision. The results suggest something optimistic about how we can make RL more efficient for language models: by giving models the space and structure to reason about their own failures, we can dramatically improve learning efficiency and final performance. It's not about smarter feedback or more data. It's about asking the model to do what humans do naturally: pause, reflect, and try again with clearer thinking.
This is a Plain English Papers summary of a research paper called Experiential Reinforcement Learning. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.
