Authors:
(1) Nathan Lambert, Allen Institute for AI;
(2) Roberto Calandra, TU Dresden.
Understanding Objective Mismatch
Acknowledgments, and References
Reward models are trained with human preference data most often consisting of a task given to the model prompt, i.e a request or instruction, and ratings of the completion, or answer. The feedback can consist of selecting the best from groups of responses (Ziegler et al., 2019), scores and rankings of a group of candidate responses (Ouyang et al., 2022), a choice between a pair of responses (Bai et al., 2022) (choose best response between two options), and even finer grained data (Z. Wu et al., 2023). The workers employed are generally given detailed instructions on which styles, occurrences, or values to prioritize in their labels.
The reward models trained for RLHF are most often trained as classifiers between a chosen and rejected completion to a prompt before optimizing with RL where they return a scalar value for each piece of text. Given two options for a completion y from a prompt x, and the scores they obtain a scalar output r from a initially untrained value head on an LLM or value model entirely, the loss for the reward model follows (Askell et al., 2021; Ouyang et al., 2022)
With LLMs, the generating model is referred to the as the policy model. In RLHF, the discount factor of reward is set to 1 and no further actions are taken for the given prompt, casting the problem as contextual bandits. An example of the RLHF loop is shown in Fig. 2b in comparison to a standard RL loop shown in Fig. 2a.
This paper is available on arxiv under CC 4.0 license.