Authors:
(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;
(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;
(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;
(4) Stefano Ermon, CZ Biohub;
(5) Christopher D. Manning, Stanford University;
(6) Chelsea Finn, Stanford University. Table of Links Abstract and 1. Introduction 2 Related Work 3 Preliminaries 4 Direct Preference Optimization 5 Theoretical Analysis of DPO 6 Experiments 7 Discussion, Acknowledgements, and References Author Contributions A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.6 Proof of Theorem 1 B DPO Implementation Details and Hyperparameters C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C.2 GPT-4 prompts for computing summarization and dialogue win rates C.3 Unlikelihood baseline D Additional Empirical Results D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.3 Human study details Additional Empirical Results D.1 Performance of Best of N baseline for Various N We find that the Best of N baseline is a strong (although computationally expensive, requiring sampling many times) baseline in our experiments. We include an evaluation of the Best of N baseline for various N for the Anthropic-HH dialogue and TL;DR summarization; the results are shown in Figure 4. D.2 Sample Responses and GPT-4 Judgments In this section, we present examples of comparisons between DPO and the baseline (PPO temp 0. for summarization, and the ground truth chosen response for dialogue). See Tables 4-6 for summarization examples, and Tables 7-10 for dialogue examples. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University. Authors: Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Preliminaries 3 Preliminaries 4 Direct Preference Optimization 4 Direct Preference Optimization 5 Theoretical Analysis of DPO 5 Theoretical Analysis of DPO 6 Experiments 6 Experiments 7 Discussion, Acknowledgements, and References 7 Discussion, Acknowledgements, and References Author Contributions Author Contributions A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.6 Proof of Theorem 1 A.6 Proof of Theorem 1 B DPO Implementation Details and Hyperparameters B DPO Implementation Details and Hyperparameters C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C.2 GPT-4 prompts for computing summarization and dialogue win rates C.2 GPT-4 prompts for computing summarization and dialogue win rates C.3 Unlikelihood baseline C.3 Unlikelihood baseline D Additional Empirical Results D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.3 Human study details D.3 Human study details Additional Empirical Results D.1 Performance of Best of N baseline for Various N We find that the Best of N baseline is a strong (although computationally expensive, requiring sampling many times) baseline in our experiments. We include an evaluation of the Best of N baseline for various N for the Anthropic-HH dialogue and TL;DR summarization; the results are shown in Figure 4. D.2 Sample Responses and GPT-4 Judgments In this section, we present examples of comparisons between DPO and the baseline (PPO temp 0. for summarization, and the ground truth chosen response for dialogue). See Tables 4-6 for summarization examples, and Tables 7-10 for dialogue examples. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Performance of Best of N Baseline for Various N and Sample Responses and GPT-4 Judgments

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Davinci Is Bad at Maths: Fine-Tuning ChatGPT Models With NodeJs and OpenAI v4

Fine-Tuning Mistral 7B: Enhance Open-Source Language Models with MindsDB and Anyscale Endpoints

Direct Preference Optimization (DPO): Simplifying AI Fine-Tuning for Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Simplifying AI Training: Direct Preference Optimization vs. Traditional RL

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Davinci Is Bad at Maths: Fine-Tuning ChatGPT Models With NodeJs and OpenAI v4

Fine-Tuning Mistral 7B: Enhance Open-Source Language Models with MindsDB and Anyscale Endpoints

Direct Preference Optimization (DPO): Simplifying AI Fine-Tuning for Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Simplifying AI Training: Direct Preference Optimization vs. Traditional RL

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps