Researchers Find Standard RL Optimization Loses Critical Signal in Multi-Reward Training

This is a Plain English Papers summary of a research paper called GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization.

The hidden problem in multi-reward training

Language models are increasingly asked to satisfy multiple objectives simultaneously. A coding assistant needs to be accurate, concise, and properly formatted. A math tutor needs to explain reasoning clearly while keeping responses digestible. A safety-aligned model needs to be helpful, harmless, and honest. These aren't competing goals in some abstract sense; they're concrete constraints that determine whether a deployed system actually works.

The natural response is to give a model multiple reward signals during training, each one capturing a distinct preference. But here's the problem: nobody actually tested whether the dominant optimization method used for single-reward reinforcement learning scales appropriately to this multi-reward setting. The field simply assumed that if a technique like GRPO (Group Relative Policy Optimization) worked well for one reward, it would handle many rewards just as effectively.

This paper challenges that assumption by demonstrating that GRPO, when applied directly to multiple rewards, suffers from a fundamental information loss problem. The fix is elegant in its simplicity: separate the normalization, not just the rewards themselves. This change, called GDPO (Group reward-Decoupled Normalization Policy Optimization), consistently outperforms GRPO across tool calling, math reasoning, and coding tasks.

Why normalization matters, and how it breaks

To understand the problem, it helps to understand what normalization does. In reinforcement learning, a model learns by comparing how good each output is relative to some baseline. The technical term is "advantage": a measure of whether an output was better or worse than average. Normalization takes these advantages and adjusts them so that good and bad outputs are equally distinguishable, making the training signal crisp and clear.

GRPO normalizes advantages at the group level, which works beautifully when you have a single reward signal. But when you have multiple distinct rewards, something unexpected happens. A single model output produces multiple reward values, one for each objective. GRPO combines these rewards into a single advantage calculation before normalizing. This seems reasonable until you realize what it means: different combinations of rewards can end up with identical advantage scores even though they represent fundamentally different trade-offs.

To make this concrete, imagine two scenarios. In the first, a model output is correct but poorly formatted. In the second, the output is formatted correctly but incorrect. With multiple rewards, these should produce clearly different training signals. But GRPO's group-level normalization can collapse both scenarios into the same advantage value, destroying the information that distinguishes them. The model receives no clear feedback about which dimension actually improved.

GRPO maps different reward combinations into only two distinct advantage groups, whereas GDPO normalizes each reward independently, preserving more nuanced distinctions

This problem gets worse as the system grows more complex. With N model outputs and M reward signals, the number of theoretically possible distinct advantage combinations grows exponentially: 2 to the power of N times M. But GRPO's group-level normalization crushes this down to just 2N distinct groups. This represents a staggering loss of information resolution.

Advantage groups preserved by different methods. As the number of rollouts or rewards increases, GDPO preserves exponentially more distinct advantage groups than GRPO, preventing information collapse.

The problem doesn't disappear when you add more training data. In fact, it compounds. The more rollouts a system generates, the more information GRPO throws away. For someone training a large language model with multiple objectives, this is a critical failure mode hiding in plain sight.

How GDPO fixes the collapse

The solution is to change the order of operations. Instead of combining all rewards first and then normalizing, GDPO normalizes each reward independently and then combines them. It's a small change that has large effects.

Here's how it works: for each reward signal, compute the advantage by comparing each output against the baseline using only that reward's values. Then normalize by that reward's own mean and standard deviation. Finally, combine these individual advantages into a single training signal using a weighted sum. This ensures that each reward dimension remains distinguishable from the others, and the model receives clear feedback about which objectives are improving.

This approach preserves the relative differences within each reward signal while preventing one dominant reward from drowning out others. The model can now learn genuine multi-objective behavior instead of defaulting to whichever objective happens to produce the strongest learning signal.

Testing the fix across real tasks

The tests span three distinct domains, each revealing different facets of why GDPO matters.

The first task is tool calling on Qwen2.5-1.5B. The model must simultaneously maximize correctness (does it call the right function with the right arguments?) and format adherence (is the output in the required format?). These objectives rarely conflict directly, but they do demand independent monitoring.

Training curves on tool calling. GDPO consistently converges to higher correctness and format rewards across five runs, while GRPO shows less stable progression.

GDPO converges cleanly to higher performance on both dimensions. GRPO also improves, but the path is more wobbly and the final rewards are lower. More importantly, the consistency across runs tells a story: GDPO's training is more stable.

The second task involves a real tension: math reasoning on DeepSeek-R1-1.5B, where the system optimizes for correctness while controlling response length. Longer responses often contain more correct reasoning, but they also violate user preferences for conciseness. This creates an inherent conflict.

Training behavior with conflicting objectives. Both GRPO and GDPO initially maximize the length reward, but GDPO recovers to improve correctness while maintaining length control. GRPO gets trapped in suboptimal behavior.

Watch what happens: both methods initially exploit the length reward aggressively, temporarily suppressing correctness. But GDPO recovers and finds a better balance between the objectives. GRPO gets stuck. This is the core failure mode that GDPO solves. When rewards conflict, GRPO's collapsed advantage space leaves the model unable to navigate trade-offs effectively.

The third task, coding reasoning on the same model, confirms the pattern holds across domains. Testing on different model sizes (1.5B and 7B parameters) shows the improvement generalizes rather than being specific to one architecture.

When reward conflicts cause real trouble

The length reward problem deserves deeper investigation because it reveals why practitioners should care about this fix. If you give a model two competing signals, it will exploit whichever one is easier. In this case, simply producing longer output often correlates with higher scores, so the model learns to generate verbose responses regardless of whether they contain better reasoning.

When the length reward weight is high, GRPO-trained models sometimes violate length constraints by 10% or more while sacrificing accuracy. GDPO maintains both better length adherence and better accuracy simultaneously.

Accuracy and length violations under varying reward weights. As the length reward weight increases, GRPO accuracy drops sharply and length violations spike. GDPO maintains more stable performance across all weights.

The paper also tests a refinement: conditioning the length reward on whether the output is correct. Rather than rewarding length indiscriminately, this approach only rewards longer explanations when they lead to correct answers. This is a different kind of fix, addressing the reward design itself rather than the optimization method.

Training with conditioned rewards. Even with better-designed conditioned rewards, GDPO maintains more stable improvement in both objectives.

GDPO remains superior even when the rewards themselves are improved. This demonstrates that the problem isn't just "your rewards are poorly designed," though that's part of it. The fundamental issue is how the optimization method handles the simultaneous existence of multiple distinct objectives.

Why this matters for future AI systems

As language models become more capable, user expectations expand. A model must not only be accurate but also helpful, efficient, safe, aligned with individual preferences, and respectful of format constraints. This isn't a problem that will go away; it will intensify.

Current approaches often treat this as a single aggregation problem: combine multiple objectives into one reward, optimize that, and hope the trade-offs resolve favorably. But the multi-reward training landscape is more nuanced. Different objectives have different scales, different variance, and different learning dynamics. Treating them as a unified signal causes information loss that compounds as complexity grows.

GDPO's contribution is to show that the order of operations in normalization matters profoundly. This insight invites similar scrutiny of other popular techniques. Do other RL methods have similar hidden limitations? What other normalization schemes might collapse important signal dimensions?

For practitioners, the fix is immediately deployable. GDPO requires minimal code changes compared to GRPO: normalize each reward independently using its own statistics, then combine. No architectural changes needed. This low barrier to adoption means the improvement can propagate quickly through the field.

For researchers, this work opens questions about the relationship between normalization schemes and multi-objective optimization. It demonstrates that assumptions that hold for single-objective learning can fail spectacularly in multi-objective settings. As AI systems become more ambitious and complex, understanding these scaling behaviors becomes essential.

The deeper insight is about information preservation. Any time you apply a lossy compression (like normalization) to high-dimensional data (like multiple reward signals), you risk destroying the very distinctions that matter for learning. GDPO's answer is not to avoid normalization, which serves a genuine purpose, but to apply it at the right granularity: independently for each objective, then combined downstream.

This reflects a general principle in machine learning: the structure of your data should match the structure of your algorithm. When you have multiple distinct reward signals, treating them as truly distinct during normalization aligns your optimization method with the reality of the problem you're solving.

If you like these kinds of analyses, join AIModels.fyi or follow us on Twitter.