paint-brush
The Mechanics of Reward Models in RLHFby@feedbackloop

The Mechanics of Reward Models in RLHF

tldt arrow

Too Long; Didn't Read

Delve into the mechanics of training reward models in RLHF for language models, where human preference data guides the classification of optimal responses. Understand the intricacies of feedback, from group selections to pairwise choices, shaping the scalar output for each text piece. Explore how reinforcement learning on language transforms the generating model into a policy model, creating a contextual bandits scenario for improved language generation.
featured image - The Mechanics of Reward Models in RLHF
The FeedbackLoop: #1 in PM Education HackerNoon profile picture

Authors:

(1) Nathan Lambert, Allen Institute for AI;

(2) Roberto Calandra, TU Dresden.

Table of Links

Abstract & Introduction

Related Work

Background

Understanding Objective Mismatch

Discussions

Conclusion

Acknowledgments, and References

3 Background

3.1 Reward model training

Reward models are trained with human preference data most often consisting of a task given to the model prompt, i.e a request or instruction, and ratings of the completion, or answer. The feedback can consist of selecting the best from groups of responses (Ziegler et al., 2019), scores and rankings of a group of candidate responses (Ouyang et al., 2022), a choice between a pair of responses (Bai et al., 2022) (choose best response between two options), and even finer grained data (Z. Wu et al., 2023). The workers employed are generally given detailed instructions on which styles, occurrences, or values to prioritize in their labels.


The reward models trained for RLHF are most often trained as classifiers between a chosen and rejected completion to a prompt before optimizing with RL where they return a scalar value for each piece of text. Given two options for a completion y from a prompt x, and the scores they obtain a scalar output r from a initially untrained value head on an LLM or value model entirely, the loss for the reward model follows (Askell et al., 2021; Ouyang et al., 2022)



Figure 2: Comparing RLHF to a traditional RL problem. (left) is the canonical RL problem, where an agent interactsrepeatedly with an environment. (right) is RLHF, where an agent is optimized against a set of predetermined prompts

3.2 Reinforcement Learning on Language


With LLMs, the generating model is referred to the as the policy model. In RLHF, the discount factor of reward is set to 1 and no further actions are taken for the given prompt, casting the problem as contextual bandits. An example of the RLHF loop is shown in Fig. 2b in comparison to a standard RL loop shown in Fig. 2a.


This paper is available on arxiv under CC 4.0 license.