Authors:
(1) Nathan Lambert, Allen Institute for AI;
(2) Roberto Calandra, TU Dresden.
Understanding Objective Mismatch
Acknowledgments, and References
The objective mismatch in RLHF emerges from three broad causes: First, common practice in RL engineering dictates that as long as reward is increasing the model is improving. Second, the evaluation methods available for models trained with RLHF is often incomplete relative to their downstream use-cases. Third, the assumption that the reward model trained is a suitable reward function for optimization. For these reasons, objective mismatch emerges as the assumption that downstream evaluation will be correlated with the reward model score for the current policy, which is not proven.
Objective mismatch in RLHF is the result of the interactions between three different sub-components, rather than just the two (i.e., dynamics model and policy) from MBRL: It is a balance of the reward model training, the goal of getting a calibrated reward function, the policy training, the process of uniformly extracting information from the reward model, and the often bespoke evaluation techniques used for RLHF models, the process of fairly evaluating a multi-use model. There exists an interface between each pair of these three that provides an axis for erroneous assumptions regarding the true optimization problem, but the importance of each link is not uniform for mitigation.
Reward model training ↔ policy model training Uniformly extracting the information from the reward model into the policy and avoiding the reward hacking inherent to RL (Pan, Bhatia, & Steinhardt, 2022) that can result in overoptimization of reward models (Gao et al., 2022) is central to RLHF. A good reward model may not be one that is empirically easy to train a policy with high reward from, but rather one that is well correlated with downstream evaluation metrics. Common practice in RLHF, especially with larger models where gradients are less stable, is to spend additional compute in search of “stable” training runs with increasing reward, which induces further likelihood of mismatch.
Reward model training ↔ evaluation tools While relatively little work and resources exist for the study of state-ofthe-art reward models, the matching of the reward signal they provide to the intended use-case of the final policy (via evaluation metrics) is central to solving the objective mismatch issue. The reward models are trained on aggregated datasets to maximize agreement of the model on a held out set of data, which in practice often only maximizes at 60-75% accuracy (Bai et al., 2022; Ouyang et al., 2022). Given the complex task encompassed in reward modeling, it is unlikely that the models converge to 100% accuracy, but studying the sources of this performance delta could indicate sources of mismatch. New tools are needed for evaluation of reward models that better match their conceptual underpinnings as a representation of human values for solving the alignment problem (Leike et al., 2018) and as a practical realization as targets for optimization Lambert, Gilbert, and Zick (2023).
Policy model training ↔ evaluation tools The third link contributes the least to the emergence of mismatch, but is the easiest axis to visualization potential signs of objective mismatch. Directly matching RL training with any additional evaluation metrics is technically challenges. In MBRL, such a solution could be by using a differentiable simulator (R. Wei et al., 2023), but with the complexity of RLHF such solutions are less obtainable. Exploring any types of regularization or calibration of training with respect to final evaluations is viable as research directions, but this area of study is best suited for visualizing signs of objective mismatch, as shown in Fig. 3.
There is already emerging research on many potential causes of mismatch in RLHF, yet further work can be inspired by solutions from the broader RL literature. Many of the solutions to objective mismatch in MBRL will not apply directly because in MBRL they have a true reward from the environment, and for that reason research is needed to understand the outputs of reward models. Here follows a series of investigations to expanded to mitigate objective mismatch:
• Reward model evaluation There are many axes by which a reward model is expected to behave in order to be a reasonable approximation of a reward function, but they are typically not studied. Reward models need to be assessed for consistency, robustness to adversarial attacks, calibration across distributions, and more, as discussed in Lambert, Gilbert, and Zick (2023). Understanding reward models performance is the foundation of solving the mismatch problem. Evaluating reward models will be a indirect but useful path to measure the varied preference datasets used for open RLHF models.
• Reward model training methods In order to solve limitations of reward models across better evaluation techniques, numerous new training methods will be developed. Early research has already shown reward model ensembles can help mitigate overoptimization (Coste et al., 2023). Further research is warranted to integrate techniques that have improved performance of model-based RL algorithms, such as probabilistic loss functions for the dynamics models and planning (Chua, Calandra, McAllister, & Levine, 2018), calibrated probability estimates (Malik et al., 2019) during training the reward model as a classifier, and other solutions (R. Wei et al., 2023). Additionally, links should be explored between the reward models of inverse reinforcement learning (IRL) (Ng, Russell, et al., 2000), the subfield tasked with learning a reward function from agent behavior, and those of RLHF. Early research also shows reformatting the reward model training to better match preference learning literature could improve performance (Knox et al., 2023).
• Reward model training datasets High-quality datasets are a bottleneck slowing progress in open RLHF research, given the large costs required to acquire them. There are a few datasets available, but they are unproven in their ability match the performance of the best models. The Stack Exchange preference dataset (Lambert, Tunstall, Rajani, & Thrush, 2023), Stanford Preferences Dataset of Reddit content (Ethayarajh, Choi, & Swayamdipta, 2022), UltraFeedback synthetic preference data (Cui et al., 2023), WebGPT internet browsing (Nakano et al., 2021), learning to summarize (Stiennon et al., 2020), and Anthropic HHH dataset (Askell et al., 2021) serve as a strong foundation for research, but exploration is needed.
• Value-guided sampling techniques Increased compute can be spent at inference time to improve the performance of RLHF models by utilizing the values returned by the reward model (Deng & Raffel, 2023). Feng et al. (2023) explores this through Monte Carlo tree search generation, yet many more methods can be explored across the planning literature.
• Human-centric NLP evaluation The most popular evaluation technique for chat-tuned RLHF models is preference percentage versus other top models on evaluation prompt sets (as done in open RLHF models including Llama 2 (Touvron et al., 2023) and Dromedary-2 (Sun et al., 2023)). This evaluation mechanism, while well-motivated in the popular use-cases of the models, suffers from bias and reproducibility challenges. The prompts can easily be chosen to support the model designed by the authors, and the prompts are often not released or aggregated into a future benchmark. Expanding the reproducibility and consistency of these practices will be important to creating robust practices for RLHF.
• RL optimizers for language As discussed in Sec. 2.1, the optimizers used for RLHF are most often those from previous RL literature. Now there is an opportunity for expansion of RL algorithms into the niche of RLHF, where conditions are highly specialized through the expansive actions space and bandits formulation. New algorithms are a step in the right direction, such as T. Wu et al. (2023) modifying the PPO algorithm for pairwise preferences or Baheti et al. (2023) proposing an offline RL algorithm for full-completion actions.
• Other solutions Other creative mismatch solutions will exist, such as work integrating the LLM policy, reward model, and transition function into a single model (Xu, Dong, Arumugam, & Van Roy, 2023). Methods such as this need to be evaluated across many scales to confirm that they are still numerically stable with the larger state-of-the-art models where powerful emergent behaviors exist.
The original training object used in popular language model architectures, autoregressive next-token prediction also suffers from an objective mismatch problem, as almost all LLM evaluation techniques evaluate the entire output rather than individual tokens. While this is true, the development signal that the next-token prediction loss provides is more orthogonal to the goals of RLHF. In RLHF, and most related work in RL, the reward signal is interpretted as a direct indicator of performance. This assumption creates a much more unintentionally nuanced research setup, warranting the specific study of its impacts.
In MBRL, the learning of a dynamics model is also often done via one-step transitions, with recent work studying autoregressive models (Janner, Li, & Levine, 2021; Lambert, Wilcox, Zhang, Pister, & Calandra, 2021), where the compounding error of multiple one-step predictions is well known as a deeply related issue to objective mismatch (Lambert, Pister, & Calandra, 2022). In the case where mismatch becomes a fundamental problem of LLMs, similar solutions could be investigated.
This paper is available on arxiv under CC 4.0 license.