paint-brush
GPT-4 Prompts for Computing Summarization and Dialogue Win Ratesby@textmodels

GPT-4 Prompts for Computing Summarization and Dialogue Win Rates

tldt arrow

Too Long; Didn't Read

This article provides the GPT-4 prompts used to evaluate summarization and dialogue performance. The prompts instruct GPT-4 to compare two responses and determine which is more helpful or effective.
featured image - GPT-4 Prompts for Computing Summarization and Dialogue Win Rates
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Authors:

(1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier;

(2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier;

(3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier;

(4) Stefano Ermon, CZ Biohub;

(5) Christopher D. Manning, Stanford University;

(6) Chelsea Finn, Stanford University.

Abstract and 1. Introduction

2 Related Work

3 Preliminaries

4 Direct Preference Optimization

5 Theoretical Analysis of DPO

6 Experiments

7 Discussion, Acknowledgements, and References

Author Contributions


A Mathematical Derivations

A.1 Deriving the Optimum of the KL-Constrained Reward Maximization Objective

A.2 Deriving the DPO Objective Under the Bradley-Terry Model

A.3 Deriving the DPO Objective Under the Plackett-Luce Model

A.4 Deriving the Gradient of the DPO Objective and A.5 Proof of Lemma 1 and 2

A.6 Proof of Theorem 1


B DPO Implementation Details and Hyperparameters


C Further Details on the Experimental Set-Up and C.1 IMDb Sentiment Experiment and Baseline Details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

C.3 Unlikelihood baseline


D Additional Empirical Results

D.1 Performance of Best of N baseline for Various N and D.2 Sample Responses and GPT-4 Judgments

D.3 Human study details

C.2 GPT-4 prompts for computing summarization and dialogue win rates

A key component of our experimental setup is GPT-4 win rate judgments. In this section, we include the prompts used to generate win rates for the summarization and dialogue experiments. We use gpt-4-0314 for all our experiments. The order of summaries or responses are randomly chosen for every evaluation.


Summarization GPT-4 win rate prompt (S).


Which of the following summaries does a better job of summarizing the most \ important points in the given forum post?


Post:


Summary A:

Summary B:


FIRST provide a one-sentence comparison of the two summaries, explaining which \ you prefer and why. SECOND, on a new line, state only "A" or "B" to indicate your \ choice. Your response should use the format: Comparison: Preferred: <"A" or "B">


Summarization GPT-4 win rate prompt (C).


Which of the following summaries does a better job of summarizing the most \ important points in the given forum post, without including unimportant or \ irrelevant details? A good summary is both precise and concise.


Post:

<post>


Summary A:

<Summary A>


Summary B:

<Summary B>


FIRST provide a one-sentence comparison of the two summaries, explaining which \ you prefer and why. SECOND, on a new line, state only "A" or "B" to indicate your \ choice. Your response should use the format:


Comparison: <once-sentence comparison and explanation>


Preferred: <"A" or "B">


Dialogue GPT-4 win rate prompt.


For the following query to a chatbot, which response is more helpful?


Query: <the user query>


Response A:

<either the test method or baseline>



Response B:

<the other response>


FIRST provide a one-sentence comparison of the two responses and explain \ which you feel is more helpful. SECOND, on a new line, state only “A“ or \ “B“ to indicate which response is more helpful. Your response should use\ the format:


Comparison: <one-sentence comparison and explanation>


More helpful: <“A“ or “B“>


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.