paint-brush
The Iterative Deployment of RLHF in Language Modelsby@feedbackloop

The Iterative Deployment of RLHF in Language Models

tldt arrow

Too Long; Didn't Read

Delve into the complexities of RLHF's iterative deployment, mitigating undesirable language model qualities through exogenous feedback. Explore the societal implications and engineering challenges of this approach. Uncover the theoretical alignment of RLHF with contextual bandits, paving the way for potential real-world applications.
featured image - The Iterative Deployment of RLHF in Language Models
The FeedbackLoop: #1 in PM Education HackerNoon profile picture

Authors:

(1) Nathan Lambert, Allen Institute for AI;

(2) Roberto Calandra, TU Dresden.

Table of Links

Abstract & Introduction

Related Work

Background

Understanding Objective Mismatch

Discussions

Conclusion

Acknowledgments, and References

5 Discussions

Iterative deployment of RLHF The iterative deployment form of RLHF where reward models are retrained based on user data, which induces a second feedback loop, is shown in Fig. 4. Schulman (2023) discusses how this is used in ChatGPT to mitigate issues such as evasiveness, verbosity, and other unexpected, undesirable qualities. Designing in this framework introduces further complexity onto engineering objectives, but allows iterative mitigation of mismatch. This style of iterative RL deployment has been understood as exogenous feedback (Gilbert, Dean, Zick, & Lambert, 2022) and can have societal implications.


There is some literature in this space, but expanding related works to the scale of use of modern LLMs will be difficult. For example, Suhr and Artzi (2022) shows theoretical results on outer-loop optimization of instruction-tuned models.


Contextual bandits The modifications made to the RL optimization of RLHF cast it as a contextual bandits problem, where an agent takes one action and the dynamics are abstracted into one trajectory-reward pairing. Work in this area has investigated the potential of integrating partial, skewed, or noisy human feedback into the optimization process (Nguyen, Daumé III, & Boyd-Graber, 2017).


Figure 4: The outer loop of RLHF is the process to train the next reward model for RL to improve performance is areasof user interest. This setup induces additional complexity in objective mismatch in RLHF.


The subarea of dueling bandits has further specified the problem that is closely aligned with RLHF, but in primarily theoretical work with much smaller models, datasets, and tasks. Yue, Broder, Kleinberg, and Joachims (2012) explains this space in work showing theoretical bounds:


“In contrast to conventional approaches that require the absolute reward of the chosen strategy to be quantifiable and observable, our setting assumes only that (noisy) binary feedback about the relative reward of two chosen strategies is available. This type of relative feedback is particularly appropriate in applications where absolute rewards have no natural scale or are difficult to measure... but where pairwise comparisons are easy to make.”


This, while closely related to RLHF, will require substantial experimentation to be applicable. Others have built on this into work directly learning from human preferences (Sekhari, Sridharan, Sun, & Wu, 2023) or from implicit human feedback (Maghakian et al., 2022).


This paper is available on arxiv under CC 4.0 license.