Authors:
(1) Nathan Lambert, Allen Institute for AI;
(2) Roberto Calandra, TU Dresden.
Understanding Objective Mismatch
Acknowledgments, and References
Early work in RLHF focused on continuous control domains with various methods for altering the behavior across trajectories (Christiano et al., 2017; Wirth, Akrour, Neumann, Fürnkranz, et al., 2017). The impacts of RLHF today primarily has been centered around its use with LLMs. Initial work on RLHF for LLMs utilized user preferences from a batch of 4 options (Ziegler et al., 2019) to train a reward model across general LLM benchmarks. Group preferences were changed to pairwise preferences, and rather than general benchmarks the reward model was focused on the task of summarization (Stiennon et al., 2020; J. Wu et al., 2021). Next emerged general question-answering models (Ouyang et al., 2022) and web crawling agents (Nakano et al., 2021), primarily from scaling the initial model and human datasets. Now, RLHF is used to train general chat models across a variety of tasks (Bai et al., 2022; Schulman et al., 2022; Touvron et al., 2023) and in specific domains such as harm reduction (Glaese et al., 2022) or information accuracy (Menick et al., 2022).
The development of these methods has accelerated markedly, with many variations on the methodology for integrating feedback into language models (Fernandes et al., 2023). The most popular reinforcement learning optimizer is still Proximal Policy Optimization (PPO) (Schulman, Wolski, Dhariwal, Radford, & Klimov, 2017), with its many stable and scalable implementations. Recent works have been developing methods for the RL optimizer, such as the offline RL algorithm Implicit Language Q Learning (ILQL) (Snell, Kostrikov, Su, Yang, & Levine, 2022), direct preference optimization (DPO) (Rafailov et al., 2023) for utilizing preference data without a reward model, or Advantage-Leftover Lunch RL (A-LOL) (Baheti et al., 2023) which is designed to act on the entire response as a single action (which RLHF generally does).
There is a substantial emerging literature on varieties of numerical issues, unexpected behaviors such as verbosity and evasiveness (Schulman, 2023), and potential solutions in RLHF, which can be mitigated by progress on solving objective mismatch. A prominent recent example is the behavior of the flagship Llama 2 chat model refusing to answer a request asking “How do I kill a Linux process,” conflating the computer process with the morals of killing a living creature. It has been shown that there are predictable behaviors of reward model overoptimization with PPO and best-of-N optimization techniques (Gao, Schulman, & Hilton, 2022), which can be partially mitigated by training ensemble reward models (Coste, Anwar, Kirk, & Krueger, 2023). Other issues have emerged in RLHF models that demonstrates the need for improved reward models, such as a bias towards long responses (Singhal, Goyal, Xu, & Durrett, 2023), a lack of language consistency Shen et al. (2023) (invariance over changes that maintain meaning), or a reduction of output diversity Kirk et al. (2023). A similar argument is made in A. Wei, Haghtalab, and Steinhardt (2023), where the authors argue that “competing objectives and mismatched generalization” mislead the models – we present how objective mismatch covers both of these limitations and more possible failure cases.
There are also papers studying how the reward model training formulation does not align with the data collection process and downstream RL optimization, suggesting the models should model advantage estimates rather than direct value functions (Knox & Stone, 2008; Peng et al., 2023).
Core to the ideas of objective mismatch with LLMs is the methods of evaluation used to correlate performance. Historically, LLMs have been evaluated across a wide variety of tasks trying to capture specific characteristics of models, making evaluation a extremely broad process (Liang et al., 2022) where progress is saturating (Kiela, Thrush, Ethayarajh, & Singh, 2023). Now, many models are focused on hard to specify tasks such as chat, where existing benchmarks were not well correlated with performance (Zheng et al., 2023), so new chat based evaluations such as MT-Bench (Zheng et al., 2023) and AlpacaEval (Li et al., 2023) have been introduced, but substantial further work is needed.
This paper is available on arxiv under CC 4.0 license.