Authors:
(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;
(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;
(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;
(4) Stefano Ermon, CZ Biohub;
(5) Christopher D. Manning, Stanford University;
(6) Chelsea Finn, Stanford University. Table of Links Abstract and 1. Introduction 2 Related Work 3 Preliminaries 4 Direct Preference Optimization 5 Theoretical Analysis of DPO 6 Experiments 7 Discussion, Acknowledgements, and References Author Contributions A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.6 Proof of Theorem 1 B DPO Implementation Details and Hyperparameters C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C.2 GPT-4 prompts for computing summarization and dialogue win rates C.3 Unlikelihood baseline D Additional Empirical Results D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.3 Human study details 5 Theoretical Analysis of DPO In this section, we give further interpretation of the DPO method, provide theoretical backing, and relate advantages of DPO to issues with actor critic algorithms used for RLHF (such as PPO [37]). 5.1 Your Language Model Is Secretly a Reward Model Definition 1. We say that two reward functions r(x, y) and r ′ (x, y) are equivalent iff r(x, y) − r ′ (x, y) = f(x) for some function f. It is easy to see that this is indeed an equivalence relation, which partitions the set of reward functions into classes. We can state the following two lemmas: Lemma 1. Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, two reward functions from the same class induce the same preference distribution. Lemma 2.** Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem. The proofs are straightforward and we defer them to Appendix A.5. The first lemma is a well-known under-specification issue with the Plackett-Luce family of models [30]. Due to this under-specification, we usually have to impose additional identifiability constraints to achieve any guarantees on the MLE estimates from Eq. 2 [4]. The second lemma states that all reward functions from the same class yield the same optimal policy, hence for our final objective, we are only interested in recovering an arbitrary reward function from the optimal class. We prove the following Theorem in Appendix A.6: 5.2 Instability of Actor-Critic Algorithms This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University. Authors: Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Preliminaries 3 Preliminaries 4 Direct Preference Optimization 4 Direct Preference Optimization 5 Theoretical Analysis of DPO 5 Theoretical Analysis of DPO 6 Experiments 6 Experiments 7 Discussion, Acknowledgements, and References 7 Discussion, Acknowledgements, and References Author Contributions Author Contributions A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.6 Proof of Theorem 1 A.6 Proof of Theorem 1 B DPO Implementation Details and Hyperparameters B DPO Implementation Details and Hyperparameters C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C.2 GPT-4 prompts for computing summarization and dialogue win rates C.2 GPT-4 prompts for computing summarization and dialogue win rates C.3 Unlikelihood baseline C.3 Unlikelihood baseline D Additional Empirical Results D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.3 Human study details D.3 Human study details 5 Theoretical Analysis of DPO In this section, we give further interpretation of the DPO method, provide theoretical backing, and relate advantages of DPO to issues with actor critic algorithms used for RLHF (such as PPO [37]). 5.1 Your Language Model Is Secretly a Reward Model Definition 1. We say that two reward functions r(x, y) and r ′ (x, y) are equivalent iff r(x, y) − r ′ (x, y) = f(x) for some function f. Definition 1. We say that two reward functions r(x, y) and r ′ (x, y) are equivalent iff r(x, y) − r ′ (x, y) = f(x) for some function f. It is easy to see that this is indeed an equivalence relation, which partitions the set of reward functions into classes. We can state the following two lemmas: Lemma 1. Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, two reward functions from the same class induce the same preference distribution. Lemma 1. Under the Plackett-Luce, and in particular the Bradley-Terry, preference framework, two reward functions from the same class induce the same preference distribution. Lemma 2. ** Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem. Lemma 2 . Lemma 2 Two reward functions from the same equivalence class induce the same optimal policy under the constrained RL problem. The proofs are straightforward and we defer them to Appendix A.5. The first lemma is a well-known under-specification issue with the Plackett-Luce family of models [30]. Due to this under-specification, we usually have to impose additional identifiability constraints to achieve any guarantees on the MLE estimates from Eq. 2 [4]. The second lemma states that all reward functions from the same class yield the same optimal policy, hence for our final objective, we are only interested in recovering an arbitrary reward function from the optimal class. We prove the following Theorem in Appendix A.6: 5.2 Instability of Actor-Critic Algorithms This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Theoretical Analysis of Direct Preference Optimization

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Davinci Is Bad at Maths: Fine-Tuning ChatGPT Models With NodeJs and OpenAI v4

Fine-Tuning Mistral 7B: Enhance Open-Source Language Models with MindsDB and Anyscale Endpoints

Direct Preference Optimization (DPO): Simplifying AI Fine-Tuning for Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Simplifying AI Training: Direct Preference Optimization vs. Traditional RL

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Davinci Is Bad at Maths: Fine-Tuning ChatGPT Models With NodeJs and OpenAI v4

Fine-Tuning Mistral 7B: Enhance Open-Source Language Models with MindsDB and Anyscale Endpoints

Direct Preference Optimization (DPO): Simplifying AI Fine-Tuning for Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Simplifying AI Training: Direct Preference Optimization vs. Traditional RL

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps