The next great leap in artificial intelligence is the creation of “agentic LLMs”; AI that can perform complex, open-ended tasks without direct supervision. Imagine an AI that could develop entire software modules by writing, testing, and debugging its own code, or a research assistant that could formulate hypotheses, design experiments, and adapt its strategy based on the results. This is the future that researchers are working towards. However, the methods we use to train AI today are holding us back. Most current training focuses on static datasets and simple, single-turn tasks, like answering a single math question. This approach is fundamentally unsuited for teaching the long-horizon planning, trial-and-error learning, and iterative refinement needed for true agentic behavior. To bridge this gap, researchers have developed a new open-source framework called GEM (General Experience Maker). Think of GEM as being for AI agents what OpenAI-Gym was for traditional reinforcement learning: a standardized “gym” where agents can learn by interacting with complex, multi-turn environments. By putting AI agents through their paces in this new framework, researchers are embarking on a journey of discovery, already uncovering some surprising and important lessons about how to build truly capable AI. The Future of AI Training is ‘Experience,’ Not Just Data Forget Bigger Datasets, The Real Breakthrough is Experience-Based Learning Forget Bigger Datasets, The Real Breakthrough is Experience-Based Learning Common wisdom suggests that better AI comes from bigger datasets. The GEM experiments, however, reveal that the next frontier isn’t just about feeding models more static text files. Instead, the paradigm is shifting toward “experience-based learning,” where AI agents acquire skills by actively interacting with complex, simulated environments. This represents a fundamental change in how we think about AI training and is the foundational discovery that enables all the others. Rather than memorizing patterns from a fixed dataset, agents learn through dynamic processes like trial-and-error and long-horizon planning. This interactive learning is essential for developing the sophisticated capabilities required for the complex agentic tasks of the future, from coding entire applications to driving scientific discovery. A Simpler, Old-School Algorithm is Making a Powerful Comeback 2. A Simple Algorithm is Outperforming More Complicated Ones Once you have a world for an agent to experience, how do you teach it to learn? In a surprising twist, the researchers found that the key to unlocking new capabilities wasn’t a shiny new algorithm, but a battle-tested veteran from the early days of AI. Many modern reinforcement learning algorithms, such as GRPO, excel in simple, single-turn settings. However, when faced with the full complexity of multi-turn problems where agents get feedback after each step, they are fundamentally unsuited. While workarounds exist, they force compromises that prevent agents from learning crucial skills. The real breakthrough came from a simple variant of REINFORCE; a foundational “on-policy” method from 1992 (Williams, 1992), meaning it learns from the actions it takes itself, combined with a technique called Return Batch Normalization (ReBN). The study found this straightforward approach consistently outperforms or is comparable to more complex methods like PPO and GRPO across all tested environments. …ReBN outperforms or is comparable with PPO and GRPO in all evaluated environments, rendering it the strongest baseline without expensive computations, such as critic learning or extensive rollouts. To Make an AI Efficient, You Have to Teach It About Time 3. An AI’s Sense of Urgency is a Trainable Skill But it wasn’t just the choice of algorithm that mattered. The researchers found that even a single, often-overlooked parameter could dramatically change an agent’s behavior, revealing why the simpler algorithm excelled. This parameter, the “discount factor” (γ), can be thought of as a rule of the environment’s physics that controls an agent’s sense of urgency. Think of γ as an interest rate on future rewards. A high gamma (γ ≈ 1) is like a low interest rate, the agent doesn’t mind waiting for a payout. A low gamma (γ = 0.9) is like a high interest rate; rewards today are worth much more than rewards tomorrow, incentivizing the agent to finish tasks quickly. The paper provides a clear example using a GuessTheNumber game, where the optimal strategy is a binary search. When γ was high (0.999), the agent was patient but inefficient. But when γ was lowered to 0.9, the AI was incentivized to solve the problem in fewer steps and successfully learned the optimal binary search strategy. This finding is crucial because it directly connects to the previous takeaway. The major compromise required to make complex algorithms like GRPO work in multi-turn settings is that they effectively fix the discount factor at γ=1, making them unable to learn time-sensitive efficiency. It shows that subtle algorithmic details, which are lost in oversimplified training setups, are essential for teaching agents complex behaviors. To Build a Smarter Agent, Give It a Smarter Opponent The Quality of an AI’s ‘Sparring Partner’ is More Important Than Its Own Power The Quality of an AI’s ‘Sparring Partner’ is More Important Than Its Own Power This principle (that the rules and richness of the environment are paramount) led to perhaps the most counter-intuitive discovery of all. In a fascinating multi-agent experiment, researchers tested an “assistant agent” against a “user agent” that simulated customer behavior. Both agents were tested at different model strengths: weak, medium, and strong. The results turned conventional wisdom on its head. A strong user simulator paired with a weak assistant agent achieved a 44.3% success rate. In contrast, a strong assistant agent paired with a weak user simulator only succeeded 41.7% of the time. This powerful finding ties everything together. The quality and capability of the training environment aka the AI’s “sparring partner” can be even more critical to success than the raw power of the agent being trained. To build a truly capable AI, you must first build a challenging and intelligent world for it to gain experience in. Conclusion: A New Era of Learning The core lesson from GEM is that intelligence isn’t just about raw computational power; it’s an emergent property of an agent’s interaction with a rich, challenging, and responsive world. Building truly intelligent AI agents requires a fundamental shift in our approach. We must move beyond simple, single-step tasks and embrace the complexities of multi-turn, interactive, experience-based learning. Frameworks like GEM are essential tools in this journey, providing the standardized environments needed to uncover the subtle but powerful principles that govern agentic learning. They allow us to discover that simpler algorithms can be superior, that an AI’s sense of efficiency is a trainable skill, and that the quality of an agent’s environment can matter more than its own innate power. Podcast episode: Podcast episode: Apple Podcasts: https://lnkd.in/eqipuuBy Spotify: https://lnkd.in/eMDjNMKB Apple Podcasts: https://lnkd.in/eqipuuBy https://lnkd.in/eqipuuBy Spotify: https://lnkd.in/eMDjNMKB https://lnkd.in/eMDjNMKB