Authors:
(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;
(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;
(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;
(4) Stefano Ermon, CZ Biohub;
(5) Christopher D. Manning, Stanford University;
(6) Chelsea Finn, Stanford University. Table of Links Abstract and 1. Introduction 2 Related Work 3 Preliminaries 4 Direct Preference Optimization 5 Theoretical Analysis of DPO 6 Experiments 7 Discussion, Acknowledgements, and References Author Contributions A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.6 Proof of Theorem 1 B DPO Implementation Details and Hyperparameters C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C.2 GPT-4 prompts for computing summarization and dialogue win rates C.3 Unlikelihood baseline D Additional Empirical Results D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.3 Human study details D.3 Human study details In order to validate the usage of GPT4 for computing win rates, our human study collects human preference data for several matchups in the TL;DR summarization setting. We select three different algorithmic matchups, evaluating DPO (temp. 0.25), SFT (temp. 0.25), and PPO (temp 1.0) compared to the reference algorithm PPO (temp 0.). By selecting matchups for three unique algorithms as well as algorithms with a wide range of win rates vs the reference, we capture the similarity of human and GPT-4 win rates across the response quality spectrum. We sample 150 random comparisons of DPO vs PPO-0 and 100 random comparisons PPO-1 vs PPO-0, assigning two humans to each comparison, producing 275 judgments for DPO-PPO[7] and 200 judgments for PPO-PPO. We sample 125 SFT comparisons, assigning a single human to each. We ignore judgments that humans labeled as ties (which amount to only about 1% of judgments), and measure the raw agreement percentage between human A and human B (for comparisons where we have two human annotators, i.e., not SFT) as well as between each human and GPT-4. Participants. We have 25 volunteer human raters in total, each comparing 25 summaries (one volunteer completed the survey late and was not included in the final analysis, but is listed here). The raters were Stanford students (from undergrad through Ph.D.), or recent Stanford graduates or visitors, with a STEM (mainly CS) focus. See Figure 5 for a screenshot of the survey interface. We gratefully acknowledge the contribution of each of our volunteers, listed in random order: This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. [7] One volunteer did not respond for the DPO-PPO comparison. Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University. Authors: Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Related Work 2 Related Work 3 Preliminaries 3 Preliminaries 4 Direct Preference Optimization 4 Direct Preference Optimization 5 Theoretical Analysis of DPO 5 Theoretical Analysis of DPO 6 Experiments 6 Experiments 7 Discussion, Acknowledgements, and References 7 Discussion, Acknowledgements, and References Author Contributions Author Contributions A Mathematical Derivations A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.2 Deriving the DPO Objective Under the Bradley-Terry Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.3 Deriving the DPO Objective Under the Plackett-Luce Model A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2 A.6 Proof of Theorem 1 A.6 Proof of Theorem 1 B DPO Implementation Details and Hyperparameters B DPO Implementation Details and Hyperparameters C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details C.2 GPT-4 prompts for computing summarization and dialogue win rates C.2 GPT-4 prompts for computing summarization and dialogue win rates C.3 Unlikelihood baseline C.3 Unlikelihood baseline D Additional Empirical Results D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments D.3 Human study details D.3 Human study details D.3 Human study details In order to validate the usage of GPT4 for computing win rates, our human study collects human preference data for several matchups in the TL;DR summarization setting. We select three different algorithmic matchups, evaluating DPO (temp. 0.25), SFT (temp. 0.25), and PPO (temp 1.0) compared to the reference algorithm PPO (temp 0.). By selecting matchups for three unique algorithms as well as algorithms with a wide range of win rates vs the reference, we capture the similarity of human and GPT-4 win rates across the response quality spectrum. We sample 150 random comparisons of DPO vs PPO-0 and 100 random comparisons PPO-1 vs PPO-0, assigning two humans to each comparison, producing 275 judgments for DPO-PPO[7] and 200 judgments for PPO-PPO. We sample 125 SFT comparisons, assigning a single human to each. We ignore judgments that humans labeled as ties (which amount to only about 1% of judgments), and measure the raw agreement percentage between human A and human B (for comparisons where we have two human annotators, i.e., not SFT) as well as between each human and GPT-4. Participants . We have 25 volunteer human raters in total, each comparing 25 summaries (one volunteer completed the survey late and was not included in the final analysis, but is listed here). The raters were Stanford students (from undergrad through Ph.D.), or recent Stanford graduates or visitors, with a STEM (mainly CS) focus. See Figure 5 for a screenshot of the survey interface. We gratefully acknowledge the contribution of each of our volunteers, listed in random order: Participants This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv [7] One volunteer did not respond for the DPO-PPO comparison.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

Human Study Validates GPT-4 Win Rates for TL;DR Summarization

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Davinci Is Bad at Maths: Fine-Tuning ChatGPT Models With NodeJs and OpenAI v4

Fine-Tuning Mistral 7B: Enhance Open-Source Language Models with MindsDB and Anyscale Endpoints

Direct Preference Optimization (DPO): Simplifying AI Fine-Tuning for Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Simplifying AI Training: Direct Preference Optimization vs. Traditional RL

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

Davinci Is Bad at Maths: Fine-Tuning ChatGPT Models With NodeJs and OpenAI v4

Fine-Tuning Mistral 7B: Enhance Open-Source Language Models with MindsDB and Anyscale Endpoints

Direct Preference Optimization (DPO): Simplifying AI Fine-Tuning for Human Preferences

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Simplifying AI Training: Direct Preference Optimization vs. Traditional RL

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps