Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to corbyrosset@microsoft.com;
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to hassanam@microsoft.com;
(6) Tengyang Xie, Microsoft Research and Correspondence to tengyangxie@microsoft.com.
Table of Links
2.1 RLHF Based on Reward Models
2.2 RLHF with General Preferences
3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1
4 Practical Algorithm – Iterative Contrastive Self-Improvement
5 Experiments and 5.1 Experimental Setup
Appendix
A Extension to Regularized Preferences
C Additional Experimental Details
2 Preliminaries
This section provides an overview of the RL from human feedback (RLHF) pipeline. We do not differentiate between RLHF and RLAIF (e.g., Bai et al., 2022b; Lee et al., 2023), as the distinction is outside our scope of discussion. Thus, we will uniformly refer to both concepts as RLHF. However, we want to make a clear delineation between two subtle differences: RLHF maximizing point-wise reward functions, and RLHF optimizing general preferences. It should be noted that this discussion is more broadly applicable in scope to general contextual bandits setup as well.
Throughout this paper, we use x ∈ X to denote the input (i.e. the prompt) received by the LLM from a space X . In this paper, we do not consider the distribution shift over the prompts, following the standard contextual bandits setup of RLHF (e.g., Ouyang et al., 2022; Rafailov et al., 2023), and we use ρ to denote the distribution of the prompts. We use y ∈ Y to denote the response from the LLM given the prompt x (this corresponds to action in the contextual bandits setup). We also use π : X → ∆(Y) to denote the policy, which is a LLM here, and Π is the policy class.
Our discussion throughout this paper will also regularly involve the following three learning paradigms, which are originally introduced and commonly used in the RL literature:
(1) Offline: The learning algorithm operates without any active data collection, e.g., sampling from the current policy. The algorithm relies solely on an offline dataset for training.
(2) Purely on-policy: technically, online on-policy. In this setup, learning takes place by sampling outputs from the latest policy and immediately updating it based on the newly collected data. No data reuse or additional offline data is considered.
(3) Batched on-policy[3]: This setup is the middle of the offline and purely on-policy setups, striking a balance between deployment efficiency and adaptability. It involves iterative online data collection and can use other offline data. Its distinctive feature is that here the data collection in each iteration occurs in a batched fashion (e.g., akin to a dataset scale, much larger than the size of a typical mini-batch), and the amount of policy change can be more significant (e.g., running gradient steps over multiple epochs of a dataset, as opposed to tens of updates).
2.1 RLHF Based on Reward Models
This leads to the maximum-likelihood reward learning objective:
2.2 RLHF with General Preferences
We now introduce the setup for directly optimizing a general preference function, as well as provide an overview of existing solutions to achieve this goal (mostly by leveraging the symmetry of the preferences), especially those proposed by Munos et al. (2023); Swamy et al. (2024).
Here we assume that the learner is given query access to a general preference function P(y ≻ y ′ | x) ∈ [0, 1], for any (x, y, y′ ) ∈ X × Y × Y. This function indicates the probability that action y is preferred over y ′ given the context x. In practice, this setup can be viewed as the theoretical mode of RLAIF (e.g., Bai et al., 2022b; Yuan et al., 2024), human-in-the-loop RLHF (e.g., Ouyang et al., 2022), or distillation fine-tuning (e.g., Tunstall et al., 2023).
One common difficulty in optimizing a general preference function is its intransitivity, e.g., it is possible that P(a ≻ b) = P(b ≻ c) = P(c ≻ a) = 1, for some options (a, b, c) (details see, e.g., Bertrand et al., 2023; Munos et al., 2023; Swamy et al., 2024). Therefore, the learning goal of optimizing general preferences can be the Nash equilibrium of the two-player zero-sum game with the payoffs as the general preference function P. The formal definition of such Nash equilibrium is defined by the Minimax Winner, MW (see, e.g., Kreweras, 1965; Simpson, 1969; Kramer, 1973; Fishburn, 1984), or the von Neumann Winner (see, e.g., Dudík et al., 2015),
Nash-MD. Munos et al. (2023) proposed Nash-MD to approximate the Nash equilibrium of a KL-regularized preference function,
Following this, Munos et al. (2023) demonstrate that the Nash Equilibrium of MW(Pτ ) can be approximated using a mirror descent (Nemirovskij and Yudin, 1983; Bubeck, 2015; Lattimore and Szepesvári, 2020) inspired algorithm, Nash-MD, which has a last-iteration guarantee. The Nash-MD algorithm can be viewed as a two-step iterative process: for each t = 1, 2, . . . , T,
This paper is available on arxiv under CC BY 4.0 DEED license.
[3] We acknowledge abuse of terminology. Our algorithm is not entirely online, as it only contains batched data collection. It is also not strictly on-policy because it uses examples from other policies, like a teacher. While “offline” or “off-policy” may be technically more relevant, they might lead to misunderstanding among readers and detract from the emphasis we want to place on the collection samples from the current policy, which constitute the majority of our training data.