Two Training Paths, One Smarter AI Strategy

The two paths to better models

Training large language models involves a fundamental choice between two different sources of feedback, each with its own strengths and weaknesses.

The first approach is on-policy distillation (OPD). Here, a larger teacher model watches over the student's shoulder and provides dense, token-level guidance for every decision. If the student generates a response, the teacher evaluates it step by step, signaling which tokens align with high-quality reasoning and which ones veer off track. This is information-rich supervision, theoretically efficient, and it's proven remarkably effective in practice. Models trained this way converge faster and reach higher performance than many alternatives.

The second approach is reinforcement learning with verifiable rewards (RLVR). Instead of having a teacher narrate every step, the environment itself provides feedback: the response is correct or incorrect, the math problem is solved or unsolved, the reasoning chain is valid or flawed. This signal is sparse, often just a single bit of information per trajectory, but it's anchored to ground truth. The environment doesn't lie or simplify; it reports what actually happened.

The field leaned toward distillation because dense feedback is powerful. But an obvious question emerged: what if a model could supervise itself? If the student could access privileged information like the correct answer, it could generate both a raw response and a teacher-like critique, creating dense supervision without needing an external model. This led to on-policy self-distillation (OPSD), where the same model acts as both student and teacher. The approach sounded elegant. It had to work.

It didn't.

Why self-distillation alone creates instability

When researchers trained models using pure OPSD, performance climbed rapidly at first, then crashed. The models weren't learning robustly; they were learning to satisfy their own supervision signal in brittle ways that didn't generalize.

Consider what happens in the training loop. The model generates a response without access to the correct answer. Then, in a separate forward pass, the same model generates what it would have said if it had known the answer. The difference between these two versions becomes the supervision signal. Theoretically, this teaches the model to move toward correct answers. In practice, something else occurred.

Performance comparison of OPSD vs RLSD during training on multimodal reasoning tasks, showing OPSD reaching peak performance early before degrading, while RLSD maintains stability and reaches a higher convergence ceiling.

OPSD reaches its peak performance early and then degrades as information leakage accumulates, whereas RLSD inherits the training stability of GRPO while achieving a higher convergence ceiling.

The model found itself in a feedback loop. When the student and teacher versions of the model were already similar in some way, the supervision signal reinforced that similarity, even if both versions were making identical mistakes. The information flowing through training wasn't anchored to external truth; it was circular, bouncing between different forward passes of the same model without ever touching ground reality.

The leakage trap

The technical name for this problem is information leakage, and it's the core reason why naive self-distillation destabilizes training.

When an external teacher corrects a student, the correction is grounded in real knowledge. If the student wrote "2 + 2 = 5" and the teacher responds with "2 + 2 = 4," the student learns something about the world. But when a model supervises itself, the teacher's knowledge is the student's knowledge, just retrieved differently. If both the student and teacher version of the model believe "2 + 2 = 5" (perhaps because they've been trained together), the self-supervision reinforces this error rather than correcting it.

More precisely, the model's representations drift over time. The token-level policy differences that the teacher generates become increasingly correlated with the student's own errors, because they share the same underlying weights. The supervision signal, which should be pulling the model toward better behavior, ends up pulling it toward more sophisticated versions of its existing mistakes.

This manifests in measurable ways. When researchers tracked the KL divergence between the student and teacher in OPSD—essentially measuring how different the two versions actually were during training—the signal became erratic. Rather than converging to stable behavior, the models oscillated in ways that looked like learning progress on the training objective but revealed brittle, self-reinforcing solutions.

Leakage, KL divergence, and validation performance across OPSD and ablated variants, showing divergent KL values and unstable performance trajectories.

OPSD exhibits training instability across multiple metrics, with KL divergence diverging unpredictably and validation performance degrading despite early gains.

Using two signals for different purposes

The paper's central insight is deceptively simple: RLVR and self-distillation fail in opposite ways. RLVR provides reliable directional feedback but only rarely, making it slow to learn when correct outcomes are infrequent. Self-distillation provides dense feedback but without external grounding, causing information to leak back into itself. What if each signal fixed the other's weakness?

The solution is RLSD (RLVR with Self-Distillation), which separates the roles of the two training signals.

Let the environment determine which direction the model should move. Does the model's response solve the problem correctly or not? This binary feedback is sparse but anchored to reality. It prevents the model from drifting into self-consistent but wrong behaviors.

Let self-distillation determine how far to move. Given that a correct answer exists, how different would the model's response be if it knew the answer? Measure this token-level difference, but use it only as a magnitude scaling factor for the policy gradient updates, not as a direct learning target. The model isn't trying to imitate itself; it's using self-imitation to calibrate how confidently to apply the environmental feedback.

Overview of the RLSD method, showing how RLVR provides directional signal and self-distillation provides magnitude weighting.

RLSD combines directional feedback from RLVR with magnitude weighting from self-distillation, avoiding information leakage while capturing the benefits of both approaches.

The mechanism is elegant in its specificity. RLSD computes token-level policy differences between the student and teacher, then uses these differences to scale the policy gradient updates. Positions where the student and teacher most strongly disagree receive the largest magnitude adjustments. Positions where they agree receive minimal updates. This focuses learning effort on exactly the decisions that matter most, the ones where the model's current understanding is furthest from what it would generate with perfect information.

This hybrid approach avoids leakage because the supervision signal is no longer circular. The direction comes from outside the model; the magnitude comes from the model's self-comparison. Neither signal dominates. Neither signal alone drives the training dynamics.

Where the improvement actually emerges

The empirical results validate that this separation of concerns works. RLSD maintains the training stability of RLVR (the environment keeps pulling it toward reality), but it achieves a higher convergence ceiling than pure RLVR alone. The dense guidance from self-distillation accelerates learning in the right regions of policy space without causing the collapse seen in pure OPSD.

Training dynamics on multimodal reasoning tasks, comparing RLSD against baselines across different problem types.

Training curves demonstrate that RLSD maintains stability while converging to higher performance than RLVR alone, particularly on complex multimodal reasoning tasks.

But the most revealing evidence comes from visualizing what the model actually learns at the token level. When researchers examined credit heatmaps for individual examples, a clear pattern emerged: on correct trajectories, RLSD concentrates learning on the decisive steps. In a multimodal reasoning task, this might be the counting step or the arithmetic operation. The model learns to be more confident about the steps that matter. On incorrect trajectories, the model learns to downweight high-confidence mistakes, the decisions that pulled the response off track.

This focused learning is exactly what the theory predicts. The model isn't just fitting a loss function; it's learning what matters.

Token-level credit heatmaps showing concentrated learning on decisive steps in correct trajectories and on error-prone steps in incorrect trajectories.

In a correct trajectory (top), RLSD concentrates credit on counting and subtraction steps. In an incorrect trajectory (bottom), credit is concentrated on steps containing high-confidence errors, teaching the model to downweight them.

The broader lesson

This research connects to a wider recognition in the field that verifiable feedback structures matter. Work on verifiable reward chains and positive unlabeled learning with distillation have all pointed toward the value of grounding learning in verifiable outcomes rather than assuming that dense supervision alone ensures progress.

What makes RLSD noteworthy is its specific insight about when each signal should be applied. The paper demonstrates that the most powerful approach isn't choosing between paradigms but understanding their complementary roles. RLVR is good at pointing toward reality. Self-distillation is good at providing dense guidance. When information is shared—when one signal tries to do both jobs—it spirals into leakage. When information is separated—each signal with its own clear role—they become partners rather than competitors.

This principle extends beyond this specific technique. When facing a choice between different training paradigms, the instinct is often to pick the winner. But the real insight frequently lies in asking: what is each approach uniquely good at? What failure mode does each one avoid? Sometimes the most powerful solutions come not from new techniques but from understanding how existing ones can be orchestrated to address each other's weaknesses. In the case of RLSD, that orchestration required only a conceptual reframing: separate concerns between direction and magnitude, anchor one to ground truth, and let the other provide calibration. The result is training that's both stable and efficient, combining the virtues of two seemingly incompatible approaches.

This is a Plain English Papers summary of a research paper called Self-Distilled RLVR. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.