The Fix for Reasoning RL’s Data Problem: Recombine It

Written by aimodels44 | Published 2026/02/20
Tech Story Tags: ai | data-efficiency-crisis | verified-rewards | dataset-amplification | multi-thread-reasoning | structural-difficulty | curriculum-learning | stale-training-data

TLDRComposition-RL recombines verified prompts to create structurally harder training tasks, improving reasoning RL across model sizes and benchmarks.via the TL;DR App

The data efficiency crisis in reasoning

Training a language model to solve complex problems requires something more demanding than most datasets offer. You need verifiable step-by-step solutions, not just correct answers, so the model learns actual reasoning rather than pattern matching. These datasets are expensive to create, requiring human experts or careful algorithmic verification, which means they're perpetually scarce.

The reinforcement learning approach to improving reasoning depends entirely on these verified datasets. But here's where an uncomfortable problem emerges: as a model trains and improves, something unexpected happens. Problems that initially challenged the model become trivial. The model solves them repeatedly, yet they remain in the training set consuming compute and attention. You're essentially training with drills the model has already mastered, burning resources on stale feedback.

This is the fundamental tension the paper addresses. Verifiable prompts are essential infrastructure for reasoning improvement through RL, but they're filled with uninformative examples that waste training potential. The obvious fix, collecting more data, runs into the hard constraint that verification signals don't scale cheaply. So researchers face a different question: can you extract more value from the limited data you already have?

Why easy problems contain hidden value

Most approaches to this problem treat fully-solved problems as waste. Once a model achieves a perfect pass rate on a problem, it gets discarded or deprioritized. The insight in this work inverts that logic: easy problems aren't dead weight, they're building blocks.

Consider how humans learn mathematics. Mastering single-digit addition doesn't become useless once you advance to multi-digit problems. Those foundational skills become components of more complex reasoning. A student combines addition with place value understanding to tackle harder arithmetic, then combines arithmetic knowledge with algebra, and so on. The original simple skill remains valuable because it can be recombined in new ways.

The same principle applies to language models. When a model solves Problem A correctly and Problem B correctly independently, it doesn't automatically solve "solve A and B together." That conjunction creates something genuinely new. The model must manage multiple reasoning threads simultaneously, combine intermediate results correctly, and avoid confusion between separate problem contexts. These are real challenges even if both underlying problems are individually simple.

This reframing changes everything. Rather than seeing fully-solved problems as depleted resources, you see them as unused components waiting to be rearranged. If you have N verified problems, you can generate far more compositional variants by combining them, instantly expanding your effective training dataset without additional human annotation.

Composing problems into new challenges

The core mechanism is straightforward to describe but powerful in practice. Take two independent math problems and merge them into a single problem that requires solving both, then combining the results.

The paper's opening illustration shows this concretely: take "find the sum of digits in 47" (answer: 11) and "find the prime factorization of 24" (answer: 2^3 × 3), then compose them into a single question like "find the product of the sum of digits in 47 and the number of distinct prime factors of 24." Now the model must solve both components correctly and combine them properly. The composition creates a structurally different task with higher cognitive demand.

Overview of Composition-RL. Top: an example of composing two math problems, illustrating the high-level idea of Composition-RL. Bottom left: pass@1 (%) on AIME24 versus training steps for different methods.

Overview of Composition-RL. Top: an example of composing two math problems, illustrating the high-level idea. Bottom left: pass@1 (%) on AIME24 versus training steps for different methods.

What makes this approach elegant is its deliberate simplicity. The composition process is fully mechanical: extract two problems, modify them slightly to ensure independence (changing numbers where they overlap), format them as a single question, and verify correctness using the original verifiable rewards. This simplicity matters because it scales to any domain and any model size without requiring new annotation infrastructure.

Results across model scales

The empirical validation spans models from 4B to 30B parameters tested on the MATH dataset. The comparison is straightforward: RL trained on original problems versus RL trained on compositional variants.

The results confirm the intuition holds at scale. Figure 2 visualizes the core finding: the solve_all ratio, which measures what fraction of test problems the model solves correctly, climbs faster and higher when trained on compositional prompts compared to the original dataset. This isn't marginal improvement, it's consistent across all model sizes tested.

Visualization of meta-experiments. Left: solve_all ratio curve for RL of Qwen3-4B-Base with original prompts (MATH12K) versus compositional prompts. Right: avg@8 accuracy on MATH500 and its corresponding compositional test prompts.

Visualization of meta-experiments. Left: solve_all ratio curve with original prompts versus compositional prompts. Right: avg@8 accuracy on MATH500 and corresponding compositional test prompts.

On harder benchmarks like AIME24, the advantage becomes even clearer. Models trained with compositional prompts show steady progress throughout training, while standard RL plateaus. The practical implication matters: composition achieved this improvement without requiring new problems or additional human annotators. It only required rearranging existing verified solutions.

Curriculum learning through composition

The results improve further with a curriculum approach: gradually increasing compositional depth during training. Rather than immediately exposing the model to triple and quadruple compositions, training starts with individual problems and pairs, then gradually increases to deeper compositions as the model develops.

This connects to foundational understanding of how models learn. Humans benefit from carefully sequenced difficulty, starting with fundamentals before tackling complexity. The same principle applies here. By allowing the model warm-up time with single problems before facing compositional challenges, it gradually builds compositional reasoning skills rather than facing a sudden shock of complexity.

Average@8 accuracy on MATH500 and corresponding compositional test prompts across different model sizes.

Average@8 accuracy on MATH500 and corresponding compositional test prompts across different model sizes. Darker colors indicate larger improvements from Composition-RL over standard RL.

The curriculum variant yields steady improvements across all tested model sizes. This suggests composition taps into something fundamental about how models develop reasoning: they benefit from gradually increasing structural complexity, not just harder individual problems. The insight extends beyond composition itself, hinting at broader principles for structuring RL training.

Cross-domain composition

A practical scenario emerges naturally: what if you have separate verified datasets across different domains? Most organizations might have modest math problem collections, separate coding challenge datasets, and independent logic puzzle repositories. Each domain individually constrains training, but they're incommensurable, so you can't directly combine them.

Composition-RL handles this elegantly. Compose problems within domains as before, but also compose across domains. A math problem combined with a coding problem creates something genuinely novel that requires hybrid reasoning. A model trained on mixed-domain compositional prompts develops broader reasoning capabilities than models trained on single-domain data.

This cross-domain composition points to something subtle about how models learn reasoning. Exposure to compositional problems across domains doesn't just teach domain-specific skills, it teaches how to reason through multiple threads simultaneously, a meta-skill that transfers. Related work on compositional reasoning has shown that such structural skills generalize beyond their original contexts.

The practical advantage

The central insight here is elegant in its simplicity. Constraints breed innovation. Researchers couldn't afford unlimited verified data, so they asked what more could be extracted from existing data. The answer, composition, works because it doesn't generate harder numbers, it generates structurally harder problems. A model that solves A and solves B doesn't automatically solve A-plus-B, and that gap is real training signal.

For practitioners with verification signals on existing problems, composition immediately expands available training data by orders of magnitude. If you have N verified problems, you can generate compositional variants that provide fresh training challenges without additional annotation work. The approach scales from small datasets to large ones, maintaining consistent improvements across model sizes.

The broader implication is more fundamental. The work demonstrates that problem structure and complexity are learnable dimensions independent of raw difficulty. This challenges the assumption that harder datasets necessarily require harder problems, suggesting instead that cleverly recombined existing problems can provide comparable or better training signal.

Natural follow-up questions emerge from this foundation: How deep can compositional chains go before diminishing returns? Do certain combinations of problems teach better than others? Can you automatically select which problems to compose for maximum learning gain? The simplicity of the core mechanism means the research direction has substantial room to develop.

This is a Plain English Papers summary of a research paper called Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.



Written by aimodels44 | Among other things, launching AIModels.fyi ... Find the right AI model for your project - https://aimodels.fyi
Published by HackerNoon on 2026/02/20