Authors:
(1) Nathan Lambert, Allen Institute for AI;
(2) Roberto Calandra, TU Dresden.
Understanding Objective Mismatch
Acknowledgments, and References
Iterative deployment of RLHF The iterative deployment form of RLHF where reward models are retrained based on user data, which induces a second feedback loop, is shown in Fig. 4. Schulman (2023) discusses how this is used in ChatGPT to mitigate issues such as evasiveness, verbosity, and other unexpected, undesirable qualities. Designing in this framework introduces further complexity onto engineering objectives, but allows iterative mitigation of mismatch. This style of iterative RL deployment has been understood as exogenous feedback (Gilbert, Dean, Zick, & Lambert, 2022) and can have societal implications.
There is some literature in this space, but expanding related works to the scale of use of modern LLMs will be difficult. For example, Suhr and Artzi (2022) shows theoretical results on outer-loop optimization of instruction-tuned models.
Contextual bandits The modifications made to the RL optimization of RLHF cast it as a contextual bandits problem, where an agent takes one action and the dynamics are abstracted into one trajectory-reward pairing. Work in this area has investigated the potential of integrating partial, skewed, or noisy human feedback into the optimization process (Nguyen, Daumé III, & Boyd-Graber, 2017).
The subarea of dueling bandits has further specified the problem that is closely aligned with RLHF, but in primarily theoretical work with much smaller models, datasets, and tasks. Yue, Broder, Kleinberg, and Joachims (2012) explains this space in work showing theoretical bounds:
“In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure... but where pairwise comparisons are easy to make.”
This, while closely related to RLHF, will require substantial experimentation to be applicable. Others have built on this into work directly learning from human preferences (Sekhari, Sridharan, Sun, & Wu, 2023) or from implicit human feedback (Maghakian et al., 2022).
This paper is available on arxiv under CC 4.0 license.