The Unlikelihood Baseline in Sentiment Experiments

Written by textmodels | Published 2024/08/26
Tech Story Tags: ai-fine-tuning | direct-preference-optimization | reinforcement-learning | language-models | language-model-optimization | reward-modeling | bradley-terry-model | rhlf-explained

TLDR

The unlikelihood baseline is a simple approach that maximizes the probability of the preferred response while minimizing the probability of the dispreferred response. However, it often produces meaningless results due to unconstrained likelihood minimization. The article includes examples of prompts used to evaluate summarization and dialogue performance.via the TL;DR App

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

Table of Links

Abstract and 1. Introduction

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

7 Discussion, Acknowledgements, and References

Author Contributions

A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1

B DPO Implementation Details and Hyperparameters

C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline

D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

C.3 Unlikelihood baseline

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Written by textmodels | We publish the best academic papers on rule-based techniques, LLMs, & the generation of text that resembles human text.

Published by HackerNoon on 2024/08/26