Authors:
(1) Corby Rosset, Microsoft Research and Correspondence to corbyrosset@microsoft.com;
(2) Ching-An Cheng, Microsoft Research;
(3) Arindam Mitra, Microsoft Research;
(4) Michael Santacroce, Microsoft Research;
(5) Ahmed Awadallah, Microsoft Research and Correspondence to hassanam@microsoft.com;
(6) Tengyang Xie, Microsoft Research and Correspondence to tengyangxie@microsoft.com. Table of Links Abstract and 1 Introduction 2 Preliminaries 2.1 RLHF Based on Reward Models 2.2 RLHF with General Preferences 3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1 3.2 Theoretical Analysis 4 Practical Algorithm – Iterative Contrastive Self-Improvement 5 Experiments and 5.1 Experimental Setup 5.2 Results and Analysis 6 Related Work 7 Conclusion and References Appendix A Extension to Regularized Preferences B Detailed Proofs C Additional Experimental Details 2 Preliminaries This section provides an overview of the RL from human feedback (RLHF) pipeline. We do not differentiate between RLHF and RLAIF (e.g., Bai et al., 2022b; Lee et al., 2023), as the distinction is outside our scope of discussion. Thus, we will uniformly refer to both concepts as RLHF. However, we want to make a clear delineation between two subtle differences: RLHF maximizing point-wise reward functions, and RLHF optimizing general preferences. It should be noted that this discussion is more broadly applicable in scope to general contextual bandits setup as well. Throughout this paper, we use x ∈ X to denote the input (i.e. the prompt) received by the LLM from a space X . In this paper, we do not consider the distribution shift over the prompts, following the standard contextual bandits setup of RLHF (e.g., Ouyang et al., 2022; Rafailov et al., 2023), and we use ρ to denote the distribution of the prompts. We use y ∈ Y to denote the response from the LLM given the prompt x (this corresponds to action in the contextual bandits setup). We also use π : X → ∆(Y) to denote the policy, which is a LLM here, and Π is the policy class. Our discussion throughout this paper will also regularly involve the following three learning paradigms, which are originally introduced and commonly used in the RL literature: (1) Offline: The learning algorithm operates without any active data collection, e.g., sampling from the current policy. The algorithm relies solely on an offline dataset for training. (2) Purely on-policy: technically, online on-policy. In this setup, learning takes place by sampling outputs from the latest policy and immediately updating it based on the newly collected data. No data reuse or additional offline data is considered. (3) Batched on-policy[3]: This setup is the middle of the offline and purely on-policy setups, striking a balance between deployment efficiency and adaptability. It involves iterative online data collection and can use other offline data. Its distinctive feature is that here the data collection in each iteration occurs in a batched fashion (e.g., akin to a dataset scale, much larger than the size of a typical mini-batch), and the amount of policy change can be more significant (e.g., running gradient steps over multiple epochs of a dataset, as opposed to tens of updates). 2.1 RLHF Based on Reward Models This leads to the maximum-likelihood reward learning objective: 2.2 RLHF with General Preferences We now introduce the setup for directly optimizing a general preference function, as well as provide an overview of existing solutions to achieve this goal (mostly by leveraging the symmetry of the preferences), especially those proposed by Munos et al. (2023); Swamy et al. (2024). Here we assume that the learner is given query access to a general preference function P(y ≻ y ′ | x) ∈ [0, 1], for any (x, y, y′ ) ∈ X × Y × Y. This function indicates the probability that action y is preferred over y ′ given the context x. In practice, this setup can be viewed as the theoretical mode of RLAIF (e.g., Bai et al., 2022b; Yuan et al., 2024), human-in-the-loop RLHF (e.g., Ouyang et al., 2022), or distillation fine-tuning (e.g., Tunstall et al., 2023). One common difficulty in optimizing a general preference function is its intransitivity, e.g., it is possible that P(a ≻ b) = P(b ≻ c) = P(c ≻ a) = 1, for some options (a, b, c) (details see, e.g., Bertrand et al., 2023; Munos et al., 2023; Swamy et al., 2024). Therefore, the learning goal of optimizing general preferences can be the Nash equilibrium of the two-player zero-sum game with the payoffs as the general preference function P. The formal definition of such Nash equilibrium is defined by the Minimax Winner, MW (see, e.g., Kreweras, 1965; Simpson, 1969; Kramer, 1973; Fishburn, 1984), or the von Neumann Winner (see, e.g., Dudík et al., 2015), Nash-MD. Munos et al. (2023) proposed Nash-MD to approximate the Nash equilibrium of a KL-regularized preference function, Following this, Munos et al. (2023) demonstrate that the Nash Equilibrium of MW(Pτ ) can be approximated using a mirror descent (Nemirovskĳ and Yudin, 1983; Bubeck, 2015; Lattimore and Szepesvári, 2020) inspired algorithm, Nash-MD, which has a last-iteration guarantee. The Nash-MD algorithm can be viewed as a two-step iterative process: for each t = 1, 2, . . . , T, This paper is available on arxiv under CC BY 4.0 DEED license. [3] We acknowledge abuse of terminology. Our algorithm is not entirely online, as it only contains batched data collection. It is also not strictly on-policy because it uses examples from other policies, like a teacher. While “offline” or “off-policy” may be technically more relevant, they might lead to misunderstanding among readers and detract from the emphasis we want to place on the collection samples from the current policy, which constitute the majority of our training data. Authors: (1) Corby Rosset, Microsoft Research and Correspondence to corbyrosset@microsoft.com; (2) Ching-An Cheng, Microsoft Research; (3) Arindam Mitra, Microsoft Research; (4) Michael Santacroce, Microsoft Research; (5) Ahmed Awadallah, Microsoft Research and Correspondence to hassanam@microsoft.com; (6) Tengyang Xie, Microsoft Research and Correspondence to tengyangxie@microsoft.com. Authors: Authors: (1) Corby Rosset, Microsoft Research and Correspondence to corbyrosset@microsoft.com; (2) Ching-An Cheng, Microsoft Research; (3) Arindam Mitra, Microsoft Research; (4) Michael Santacroce, Microsoft Research; (5) Ahmed Awadallah, Microsoft Research and Correspondence to hassanam@microsoft.com; (6) Tengyang Xie, Microsoft Research and Correspondence to tengyangxie@microsoft.com. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Preliminaries 2 Preliminaries 2.1 RLHF Based on Reward Models 2.1 RLHF Based on Reward Models 2.2 RLHF with General Preferences 2.2 RLHF with General Preferences 3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1 3 Direct Nash Optimization and 3.1 Derivation of Algorithm 1 3.2 Theoretical Analysis 3.2 Theoretical Analysis 4 Practical Algorithm – Iterative Contrastive Self-Improvement 4 Practical Algorithm – Iterative Contrastive Self-Improvement 5 Experiments and 5.1 Experimental Setup 5 Experiments and 5.1 Experimental Setup 5.2 Results and Analysis 5.2 Results and Analysis 6 Related Work 6 Related Work 7 Conclusion and References 7 Conclusion and References Appendix Appendix A Extension to Regularized Preferences A Extension to Regularized Preferences B Detailed Proofs B Detailed Proofs C Additional Experimental Details C Additional Experimental Details 2 Preliminaries This section provides an overview of the RL from human feedback (RLHF) pipeline. We do not differentiate between RLHF and RLAIF (e.g., Bai et al., 2022b; Lee et al., 2023), as the distinction is outside our scope of discussion. Thus, we will uniformly refer to both concepts as RLHF. However, we want to make a clear delineation between two subtle differences: RLHF maximizing point-wise reward functions, and RLHF optimizing general preferences. It should be noted that this discussion is more broadly applicable in scope to general contextual bandits setup as well. Throughout this paper, we use x ∈ X to denote the input (i.e. the prompt) received by the LLM from a space X . In this paper, we do not consider the distribution shift over the prompts, following the standard contextual bandits setup of RLHF (e.g., Ouyang et al., 2022; Rafailov et al., 2023), and we use ρ to denote the distribution of the prompts. We use y ∈ Y to denote the response from the LLM given the prompt x (this corresponds to action in the contextual bandits setup). We also use π : X → ∆(Y) to denote the policy, which is a LLM here, and Π is the policy class. Our discussion throughout this paper will also regularly involve the following three learning paradigms, which are originally introduced and commonly used in the RL literature: (1) Offline: The learning algorithm operates without any active data collection, e.g., sampling from the current policy. The algorithm relies solely on an offline dataset for training. (2) Purely on-policy: technically, online on-policy. In this setup, learning takes place by sampling outputs from the latest policy and immediately updating it based on the newly collected data. No data reuse or additional offline data is considered. (3) Batched on-policy[3]: This setup is the middle of the offline and purely on-policy setups, striking a balance between deployment efficiency and adaptability. It involves iterative online data collection and can use other offline data. Its distinctive feature is that here the data collection in each iteration occurs in a batched fashion (e.g., akin to a dataset scale, much larger than the size of a typical mini-batch), and the amount of policy change can be more significant (e.g., running gradient steps over multiple epochs of a dataset, as opposed to tens of updates). 2.1 RLHF Based on Reward Models This leads to the maximum-likelihood reward learning objective: 2.2 RLHF with General Preferences We now introduce the setup for directly optimizing a general preference function, as well as provide an overview of existing solutions to achieve this goal (mostly by leveraging the symmetry of the preferences), especially those proposed by Munos et al. (2023); Swamy et al. (2024). Here we assume that the learner is given query access to a general preference function P(y ≻ y ′ | x) ∈ [0, 1], for any (x, y, y′ ) ∈ X × Y × Y. This function indicates the probability that action y is preferred over y ′ given the context x. In practice, this setup can be viewed as the theoretical mode of RLAIF (e.g., Bai et al., 2022b; Yuan et al., 2024), human-in-the-loop RLHF (e.g., Ouyang et al., 2022), or distillation fine-tuning (e.g., Tunstall et al., 2023). One common difficulty in optimizing a general preference function is its intransitivity, e.g., it is possible that P(a ≻ b) = P(b ≻ c) = P(c ≻ a) = 1, for some options (a, b, c) (details see, e.g., Bertrand et al., 2023; Munos et al., 2023; Swamy et al., 2024). Therefore, the learning goal of optimizing general preferences can be the Nash equilibrium of the two-player zero-sum game with the payoffs as the general preference function P. The formal definition of such Nash equilibrium is defined by the Minimax Winner, MW (see, e.g., Kreweras, 1965; Simpson, 1969; Kramer, 1973; Fishburn, 1984), or the von Neumann Winner (see, e.g., Dudík et al., 2015), Nash-MD. Munos et al. (2023) proposed Nash-MD to approximate the Nash equilibrium of a KL-regularized preference function, Following this, Munos et al. (2023) demonstrate that the Nash Equilibrium of MW(Pτ ) can be approximated using a mirror descent (Nemirovskĳ and Yudin, 1983; Bubeck, 2015; Lattimore and Szepesvári, 2020) inspired algorithm, Nash-MD, which has a last-iteration guarantee. The Nash-MD algorithm can be viewed as a two-step iterative process: for each t = 1, 2, . . . , T, This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv [3] We acknowledge abuse of terminology. Our algorithm is not entirely online, as it only contains batched data collection. It is also not strictly on-policy because it uses examples from other policies, like a teacher. While “offline” or “off-policy” may be technically more relevant, they might lead to misunderstanding among readers and detract from the emphasis we want to place on the collection samples from the current policy, which constitute the majority of our training data.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Microsoft

How Direct Nash Optimization Improves AI Model Training 

The Best Way to Train AI: Reward Models vs Preference Optimization

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

256K Tokens on One GPU? Jamba’s Engineering Magic Explained

The Art of Arguing With Yourself—And Why It’s Making AI Smarter

Direct Nash Optimization Beats Bigger Models with Better Data

How Direct Nash Optimization Improves AI Model Training

How Contrastive Learning Helps AI Self-Improve

AI That Trains Itself? Here's How it Works

256K Tokens on One GPU? Jamba’s Engineering Magic Explained

The Art of Arguing With Yourself—And Why It’s Making AI Smarter

Direct Nash Optimization Beats Bigger Models with Better Data

How Direct Nash Optimization Improves AI Model Training

How Contrastive Learning Helps AI Self-Improve

AI That Trains Itself? Here's How it Works

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps