F-GRPO Fixes RL’s “Rare Solution Amnesia” Without Bigger Batches

This is a Plain English Papers summary of a research paper called F-GRPO: Don't Let Your Policy Learn the Obvious and Forget the Rare. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.

The sampling trap: why focusing on easy cases hurts

Imagine you're training a language model to solve coding problems. You collect a batch of 32 attempts at some problem, grade them as correct or incorrect, and use the correct ones to adjust the policy. This seems reasonable: learn from successes. But there's a hidden cost. If that problem has five possible correct solutions, and your batch only happens to contain two of them, the algorithm updates the policy to favor those two. The other three solutions become less likely, even though they're equally correct.

Over many training steps, this compounds. The policy concentrates on solutions it's already good at and forgets solutions it has never learned. This is the fundamental tension at the heart of modern reinforcement learning for code and math problems: the natural learning signal pushes toward common solutions, leaving rare ones behind.

This problem isn't theoretical. Reinforcement learning with verifiable rewards (RLVR), the family of algorithms used to train models like GRPO, DAPO, and CISPO, all work by sampling groups of rollouts and updating based on which ones succeed. In practice, computational limits force small batch sizes, usually 32 to 64. Large batches would be ideal, but they're not feasible. So practitioners are stuck with a constraint they understand intuitively but can't quantify precisely: small batches have blind spots.

Understanding what those blind spots look like mathematically is where this paper begins.

The hidden geometry of batch learning

Before fixing the problem, you need to know its precise shape. Does it get worse or better as batches grow? Is there a sweet spot? The researchers derive a closed-form probability: given a batch size N, what's the chance that a training update actually fires (contains mixed rewards and changes the policy) yet simultaneously misses at least one rare-correct solution?

Call this the tail-miss probability. The surprising finding is that it doesn't decrease monotonically as batch size increases. Instead, it peaks at intermediate values.

Probability that a training update is active yet misses rare-correct solutions, as a function of group size. This probability peaks at intermediate group sizes: small groups rarely produce learning signal, while large groups contain enough examples to capture rare solutions.

Probability that a training update is active yet misses rare-correct solutions versus group size. The non-monotonic relationship reveals a "sweet spot" where learning signal is strongest but blind spots are widest.

Here's why: at very small batch sizes (say, 4), groups rarely contain both correct and incorrect solutions, so the algorithm rarely gets an active learning signal. At very large batch sizes (1000+), batches contain almost everything, including rare solutions. But at intermediate sizes, something unfortunate happens. The batch is large enough to trigger learning (it contains mixed rewards) but small enough to have systematic blind spots (it misses rare solutions entirely).

This non-monotonicity is crucial. It tells us that the problem isn't solved by simply using bigger batches. The fundamental issue runs deeper.

Tail-miss probability versus group size across different base success rates. Each panel shows that the peak occurs at intermediate batch sizes, robust across different baseline success probabilities.

Tail-miss probability across different base success rates, showing the peak is robust and not an artifact of specific parameter choices.

When more data makes things worse

Here's where the puzzle gets darker. It's not just that you might fail to sample a rare solution. It's that even when you do sample one, the algorithm might actually reduce its probability in future steps. How is that possible? You just told the algorithm "this is correct." Why would it learn to avoid it?

The answer lies in how group-relative algorithms work. They don't simply ask "is this correct?" They ask "is this correct relative to the other things in this batch?"

Suppose a batch contains three solutions: A (correct, very common), B (correct, rare), and C (incorrect). The algorithm updates A and B upward and C downward. But it amplifies A much more aggressively because A is so common that observing it is strong evidence of correctness. B is rare, so the algorithm is more cautious. Over time, the policy becomes increasingly concentrated on A.

This creates a ratchet effect. Future batches are more likely to contain A and less likely to contain B, because the policy is now more aggressive about A. Eventually, B might disappear from batches entirely. When that happens, the algorithm never sees B again, and its probability in the policy can shrink.

This is formalized in a precise claim: unsampled-correct mass can shrink even as total correct mass grows. In other words, the total probability assigned to correct solutions can increase while the probability of solutions you haven't seen recently actually decreases. The model is getting better overall but worse at rare cases.

Categorical policy simulation showing total correct mass and retained positive mass over training. Panel (a) shows total correct mass increasing while panel (b) shows retained positive mass declining, demonstrating that learning can simultaneously help and hurt depending on which solutions you track.

Simulation showing that total correct mass and retained positive mass can diverge. The model improves overall while forgetting specific correct solutions it hasn't seen recently.

The mathematical story

Up to this point, the narrative has been intuitive. Now comes the rigor. The authors model a simplified setting inspired by recent work on policy optimization with rewards, with parameters for base success probability, rarity of specific solutions, and batch size. From these three inputs, they derive an exact formula for tail-miss probability.

Tail-miss probability for realistic parameters showing substantial missing probability even at practical batch sizes.

Tail-miss probability at realistic parameter values, demonstrating that the effect is non-negligible for practical batch sizes used in language model training.

This formula transforms the problem from vague intuition to mathematical precision. You can now ask: at what batch size is the miss-probability highest? How much does this bias change with the base success rate? What parameters drive the effect most strongly?

The answer reveals why batch size alone won't fix this. The relationship is too complex, with peaks that shift with other parameters. Simply increasing batch size might push the peak to a different size, not eliminate it.

A targeted fix inspired by computer vision

If batch size doesn't smoothly solve the problem, you need a different lever. The insight is this: you're not learning too much from common solutions. You're learning the right amount from them, but at the expense of rare ones. So instead of changing batch size, change how much each solution contributes to the gradient.

The authors propose a focal loss inspired approach. In computer vision, focal loss handles imbalanced datasets by downweighting easy examples and upweighting hard ones. The same principle applies here.

The modification is a difficulty-aware advantage scaling coefficient that multiplies the advantage function. High-probability solutions (which the policy already handles well) get downweighted. Low-probability solutions get preserved or even amplified. This does two things simultaneously: it reduces the gradient signal from easy-to-solve cases, and it protects the gradient signal from hard-to-solve cases.

Scaled advantage magnitude versus success probability for correct and incorrect rollouts. Solid lines show that correct rollouts receive reduced weight when they're already likely, while maintaining full weight when they're rare.

Advantage scaling factor applied to correct (solid) and incorrect (dashed) rollouts. The downweighting of high-probability cases is asymmetric and targets the bias directly.

What makes this elegant is simplicity. It's a single function multiplied at training time. No new hyperparameters beyond those already used in focal loss, which have proven robust in other domains. No change to batch size. No additional computational cost. It slots into any group-relative algorithm: GRPO, DAPO, CISPO. The modification is literally just one line of code in the advantage computation.

Does it work in practice

All the mathematics and intuition mean nothing without empirical validation. The team tested F-GRPO on Qwen2.5-7B, a 7-billion parameter language model, across multiple benchmarks in-domain and out-of-domain.

The results are consistent and substantial:

GRPO: 64.1% baseline → 70.3% with focal scaling (6.2 percentage point improvement)
DAPO: 69.3% baseline → 72.5% with focal scaling (3.2 percentage point improvement)
CISPO: 73.2% baseline → 76.8% with focal scaling (3.6 percentage point improvement)

Critically, pass@1 (single-attempt success) is preserved or improved, meaning the model doesn't become less confident or greedy. No increase in batch size or computational cost. The improvements hold across multiple algorithms and multiple benchmarks, suggesting the fix targets a fundamental issue in group-relative RL rather than an algorithm-specific quirk.

This work provides a concrete case study in how to take an intuitive but imprecise problem, quantify it rigorously, and leverage that quantification to design a targeted solution. The rare-solution problem was understood empirically by practitioners. What the paper provides is the mathematical framework to reason about it precisely and the algorithmic tool to address it directly.

For anyone training language models on code or mathematical reasoning, this is immediately actionable. For researchers studying RL more broadly, it's a model of how to move from "we have this phenomenon" to "here's exactly how it works" to "here's a lightweight fix."