109 reads

Related Work on Reinforcement Learning from Human Feedback

by The FeedbackLoop: #1 in PM EducationJanuary 16th, 2024

Too Long; Didn't Read

Certainly! Here are four SEO-friendly headlines for the provided article excerpts: "Revolutionizing Language Models: The Evolution of Reinforcement Learning from Human Feedback" "Decoding Challenges: Navigating Problem Mis-specification in RLHF for Language Models" "Beyond Metrics: Unraveling the Complex Landscape of Evaluating RLHF-Trained Language Models" "Optimizing RLHF Strategies: From Pairwise Preferences to Complex Chat Tasks" Meta Description: Explore the dynamic evolution of reinforcement learning from human feedback (RLHF) in shaping large language models. Uncover challenges like problem mis-specification and delve into the nuances of evaluating RLHF-trained language models. Discover how strategies have evolved from early continuous control domains to complex chat tasks, paving the way for more sophisticated language generation. TLDR (Summary): The article explores the evolution of RLHF, from its early applications in continuous control domains to its pivotal role in training large language models for complex tasks. It delves into challenges like problem mis-specification, showcasing unpredictable behaviors in language models and proposing potential solutions. Additionally, it discusses the evolving landscape of evaluating RLHF-trained models, emphasizing the need for specialized benchmarks in tasks such as chat for more accurate assessments.

featured image - Related Work on Reinforcement Learning from Human Feedback

‘Ai in class’ Image created by HackerNoon AI Image Generator

Authors:

(1) Nathan Lambert, Allen Institute for AI;

(2) Roberto Calandra, TU Dresden.

Table of Links

Abstract & Introduction

Related Work

Background

Understanding Objective Mismatch

Discussions

Conclusion

Acknowledgments, and References

2.1 Reinforcement learning from human feedback

Early work in RLHF focused on continuous control domains with various methods for altering the behavior across trajectories (Christiano et al., 2017; Wirth, Akrour, Neumann, Fürnkranz, et al., 2017). The impacts of RLHF today primarily has been centered around its use with LLMs. Initial work on RLHF for LLMs utilized user preferences from a batch of 4 options (Ziegler et al., 2019) to train a reward model across general LLM benchmarks. Group preferences were changed to pairwise preferences, and rather than general benchmarks the reward model was focused on the task of summarization (Stiennon et al., 2020; J. Wu et al., 2021). Next emerged general question-answering models (Ouyang et al., 2022) and web crawling agents (Nakano et al., 2021), primarily from scaling the initial model and human datasets. Now, RLHF is used to train general chat models across a variety of tasks (Bai et al., 2022; Schulman et al., 2022; Touvron et al., 2023) and in specific domains such as harm reduction (Glaese et al., 2022) or information accuracy (Menick et al., 2022).

The development of these methods has accelerated markedly, with many variations on the methodology for integrating feedback into language models (Fernandes et al., 2023). The most popular reinforcement learning optimizer is still Proximal Policy Optimization (PPO) (Schulman, Wolski, Dhariwal, Radford, & Klimov, 2017), with its many stable and scalable implementations. Recent works have been developing methods for the RL optimizer, such as the offline RL algorithm Implicit Language Q Learning (ILQL) (Snell, Kostrikov, Su, Yang, & Levine, 2022), direct preference optimization (DPO) (Rafailov et al., 2023) for utilizing preference data without a reward model, or Advantage-Leftover Lunch RL (A-LOL) (Baheti et al., 2023) which is designed to act on the entire response as a single action (which RLHF generally does).

2.2 Problem mis-specification in RLHF

There is a substantial emerging literature on varieties of numerical issues, unexpected behaviors such as verbosity and evasiveness (Schulman, 2023), and potential solutions in RLHF, which can be mitigated by progress on solving objective mismatch. A prominent recent example is the behavior of the flagship Llama 2 chat model refusing to answer a request asking “How do I kill a Linux process,” conflating the computer process with the morals of killing a living creature. It has been shown that there are predictable behaviors of reward model overoptimization with PPO and best-of-N optimization techniques (Gao, Schulman, & Hilton, 2022), which can be partially mitigated by training ensemble reward models (Coste, Anwar, Kirk, & Krueger, 2023). Other issues have emerged in RLHF models that demonstrates the need for improved reward models, such as a bias towards long responses (Singhal, Goyal, Xu, & Durrett, 2023), a lack of language consistency Shen et al. (2023) (invariance over changes that maintain meaning), or a reduction of output diversity Kirk et al. (2023). A similar argument is made in A. Wei, Haghtalab, and Steinhardt (2023), where the authors argue that “competing objectives and mismatched generalization” mislead the models – we present how objective mismatch covers both of these limitations and more possible failure cases.

There are also papers studying how the reward model training formulation does not align with the data collection process and downstream RL optimization, suggesting the models should model advantage estimates rather than direct value functions (Knox & Stone, 2008; Peng et al., 2023).

2.3 Evaluating LLMs trained with RLHF

Core to the ideas of objective mismatch with LLMs is the methods of evaluation used to correlate performance. Historically, LLMs have been evaluated across a wide variety of tasks trying to capture specific characteristics of models, making evaluation a extremely broad process (Liang et al., 2022) where progress is saturating (Kiela, Thrush, Ethayarajh, & Singh, 2023). Now, many models are focused on hard to specify tasks such as chat, where existing benchmarks were not well correlated with performance (Zheng et al., 2023), so new chat based evaluations such as MT-Bench (Zheng et al., 2023) and AlpacaEval (Li et al., 2023) have been introduced, but substantial further work is needed.