Authors:
(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;
(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;
(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;
(4) Stefano Ermon, CZ Biohub;
(5) Christopher D. Manning, Stanford University;
(6) Chelsea Finn, Stanford University. Table of Links Abstract and 1. Introduction 2 Related Work 3 Preliminaries 4 Direct Preference Optimization 5 Theoretical Analysis of DPO 6 Experiments 7 Discussion, Acknowledgements, and References Author Contributions A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.6 Proof of Theorem 1 B DPO Implementation Details and Hyperparameters C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C.2 GPT-4 prompts for computing summarization and dialogue win rates C.3 Unlikelihood baseline D Additional Empirical Results D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.3 Human study details 2 Related Work Self-supervised language models of increasing scale learn to complete some tasks zero-shot [31] or with few-shot prompts [6, 25, 11]. However, their performance on downstream tasks and alignment with user intent can be significantly improved by fine-tuning on datasets of instructions and humanwritten completions [23, 36, 13, 39]. This ‘instruction-tuning’ procedure enables LLMs to generalize to instructions outside of the instruction-tuning set and generally increase their usability [13]. Despite the success of instruction tuning, relative human judgments of response quality are often easier to collect than expert demonstrations, and thus subsequent works have fine-tuned LLMs with datasets of human preferences, improving proficiency in translation [18], summarization [38, 49], story-telling [49], and instruction-following [26, 32]. These methods first optimize a neural network reward function for compatibility with the dataset of preferences under a preference model such as the Bradley-Terry model [5], then fine-tune a language model to maximize the given reward using reinforcement learning algorithms, commonly REINFORCE [45], proximal policy optimization (PPO; [37]), or variants [32]. A closely-related line of work leverages LLMs fine-tuned for instruction following with human feedback to generate additional synthetic preference data for targeted attributes such as safety or harmlessness [2], using only weak supervision from humans in the form of a text rubric for the LLM’s annotations. These methods represent a convergence of two bodies of work: one body of work on training language models with reinforcement learning for a variety of objectives [33, 27, 46] and another body of work on general methods for learning from human preferences [12, 19]. Despite the appeal of using relative human preferences, fine-tuning large language models with reinforcement learning remains a major practical challenge; this work provides a theoretically-justified approach to optimizing relative preferences without RL. Outside of the context of language, learning policies from preferences has been studied in both bandit and reinforcement learning settings, and several approaches have been proposed. Contextual bandit learning using preferences or rankings of actions, rather than rewards, is known as a contextual dueling bandit (CDB; [48, 14]). In the absence of absolute rewards, theoretical analysis of CDBs substitutes the notion of an optimal policy with a von Neumann winner, a policy whose expected win rate against any other policy is at least 50% [14]. However, in the CDB setting, preference labels are given online, while in learning from human preferences, we typically learn from a fixed batch of offline preference-annotated action pairs [47]. Similarly, preference-based RL (PbRL) learns from binary preferences generated by an unknown ‘scoring’ function rather than rewards [9, 35]. Various algorithms for PbRL exist, including methods that can reuse off-policy preference data, but generally involve first explicitly estimating the latent scoring function (i.e. the reward model) and subsequently optimizing it [16, 9, 12, 34, 19]. We instead present a single stage policy learning approach that directly optimizes a policy to satisfy preferences. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University. Authors: Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Preliminaries 3 Preliminaries 4 Direct Preference Optimization 4 Direct Preference Optimization 5 Theoretical Analysis of DPO 5 Theoretical Analysis of DPO 6 Experiments 6 Experiments 7 Discussion, Acknowledgements, and References 7 Discussion, Acknowledgements, and References Author Contributions Author Contributions A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.6 Proof of Theorem 1 A.6 Proof of Theorem 1 B DPO Implementation Details and Hyperparameters B DPO Implementation Details and Hyperparameters C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C.2 GPT-4 prompts for computing summarization and dialogue win rates C.2 GPT-4 prompts for computing summarization and dialogue win rates C.3 Unlikelihood baseline C.3 Unlikelihood baseline D Additional Empirical Results D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.3 Human study details D.3 Human study details 2 Related Work Self-supervised language models of increasing scale learn to complete some tasks zero-shot [31] or with few-shot prompts [6, 25, 11]. However, their performance on downstream tasks and alignment with user intent can be significantly improved by fine-tuning on datasets of instructions and humanwritten completions [23, 36, 13, 39]. This ‘instruction-tuning’ procedure enables LLMs to generalize to instructions outside of the instruction-tuning set and generally increase their usability [13]. Despite the success of instruction tuning, relative human judgments of response quality are often easier to collect than expert demonstrations, and thus subsequent works have fine-tuned LLMs with datasets of human preferences, improving proficiency in translation [18], summarization [38, 49], story-telling [49], and instruction-following [26, 32]. These methods first optimize a neural network reward function for compatibility with the dataset of preferences under a preference model such as the Bradley-Terry model [5], then fine-tune a language model to maximize the given reward using reinforcement learning algorithms, commonly REINFORCE [45], proximal policy optimization (PPO; [37]), or variants [32]. A closely-related line of work leverages LLMs fine-tuned for instruction following with human feedback to generate additional synthetic preference data for targeted attributes such as safety or harmlessness [2], using only weak supervision from humans in the form of a text rubric for the LLM’s annotations. These methods represent a convergence of two bodies of work: one body of work on training language models with reinforcement learning for a variety of objectives [33, 27, 46] and another body of work on general methods for learning from human preferences [12, 19]. Despite the appeal of using relative human preferences, fine-tuning large language models with reinforcement learning remains a major practical challenge; this work provides a theoretically-justified approach to optimizing relative preferences without RL. Outside of the context of language, learning policies from preferences has been studied in both bandit and reinforcement learning settings, and several approaches have been proposed. Contextual bandit learning using preferences or rankings of actions, rather than rewards, is known as a contextual dueling bandit (CDB; [48, 14]). In the absence of absolute rewards, theoretical analysis of CDBs substitutes the notion of an optimal policy with a von Neumann winner, a policy whose expected win rate against any other policy is at least 50% [14]. However, in the CDB setting, preference labels are given online, while in learning from human preferences, we typically learn from a fixed batch of offline preference-annotated action pairs [47]. Similarly, preference-based RL (PbRL) learns from binary preferences generated by an unknown ‘scoring’ function rather than rewards [9, 35]. Various algorithms for PbRL exist, including methods that can reuse off-policy preference data, but generally involve first explicitly estimating the latent scoring function (i.e. the reward model) and subsequently optimizing it [16, 9, 12, 34, 19]. We instead present a single stage policy learning approach that directly optimizes a policy to satisfy preferences. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Simplifying AI Training: Direct Preference Optimization vs. Traditional RL

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Davinci Is Bad at Maths: Fine-Tuning ChatGPT Models With NodeJs and OpenAI v4

Fine-Tuning Mistral 7B: Enhance Open-Source Language Models with MindsDB and Anyscale Endpoints

Direct Preference Optimization (DPO): Simplifying AI Fine-Tuning for Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

How AI Learns from Human Preferences

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Davinci Is Bad at Maths: Fine-Tuning ChatGPT Models With NodeJs and OpenAI v4

Fine-Tuning Mistral 7B: Enhance Open-Source Language Models with MindsDB and Anyscale Endpoints

Direct Preference Optimization (DPO): Simplifying AI Fine-Tuning for Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

How AI Learns from Human Preferences

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps