Group rewards sound clean. Everyone works toward the same goal, everyone gets the same signal. The problem is that once agents start acting differently, that clean signal turns into total noise.
In cooperative reinforcement learning, agents quickly drift into different parts of the state space. One agent sees stable dynamics. Another gets chaos. But we still normalize the reward once, globally, as if they were living identical lives.
That assumption is a trap. It’s why your training runs are probably crashing at step 100k.
When you normalize group rewards globally, you mash together very different reward distributions. The gradients stop making sense. Credit assignment gets fuzzy. Learning either explodes or slowly dies.
You end up with the same old mess: exploding variance and gradients that just disappear.
The fix is almost boringly simple. Keep the group reward shared, but stop forcing everyone to interpret it the same way.
A Group Reward–Decoupled Normalization Policy does exactly that. The environment still emits one group reward. Nothing about the task changes. Coordination is still the objective.
But each agent normalizes that reward using its own running statistics.
Same reward in. Different scaling inside each agent.
That one separation matters more than most architectural tricks people reach for.
Each agent tracks its own mean and variance of the group reward. An agent operating in a volatile region adapts quickly. An agent in a stable region barely adjusts. Both get gradients that actually reflect their learning context.
This lines up directly with what the GDPO paper shows in a different setting. When you normalize combined rewards globally, distinct reward signals can collapse into the same advantage value. Once that happens, the policy loses useful feedback. Training looks busy, but it’s not really learning.
GDPO fixes this by decoupling normalization before aggregation. In multi-agent systems, the same logic applies. Variance does not come from the environment alone. It comes from interactions. Treating all agents as statistically identical is the mistake.
From an engineering standpoint, this is refreshingly practical. No centralized critic. No reward shaping. No extra communication.
Just stop sharing normalization.
# Group Reward–Decoupled Normalization Policy
# One group reward, per-agent normalization
initialize policy π_i and value V_i for each agent i
initialize per-agent stats: mean μ_i = 0, variance σ2_i = 1
for each training iteration:
buffers B_i = empty
for t in 1..T:
for each agent i:
action a_i ~ π_i(o_i)
(next_obs, r_group, done) = env.step(a_1..a_N)
for each agent i:
μ_i = (1 - α) * μ_i + α * r_group
σ2_i = (1 - α) * σ2_i + α * (r_group - μ_i)^2
σ_i = sqrt(σ2_i) + ε
r_norm_i = (r_group - μ_i) / σ_i
store (o_i, a_i, r_norm_i) in B_i
o_i = next_obs_i
if done: break
for each agent i:
compute advantages from B_i
update π_i and V_i
