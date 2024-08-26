Authors: (1) Rafael Rafailo, Stanford University and Equal contribution; more junior authors listed earlier; (2) Archit Sharma, Stanford University and Equal contribution; more junior authors listed earlier; (3) Eric Mitchel, Stanford University and Equal contribution; more junior authors listed earlier; (4) Stefano Ermon, CZ Biohub; (5) Christopher D. Manning, Stanford University; (6) Chelsea Finn, Stanford University.

A key component of our experimental setup is GPT-4 win rate judgments. In this section, we include the prompts used to generate win rates for the summarization and dialogue experiments. We use gpt-4-0314 for all our experiments. The order of summaries or responses are randomly chosen for every evaluation.





Summarization GPT-4 win rate prompt (S).





Which of the following summaries does a better job of summarizing the most \ important points in the given forum post?





Post:





Summary A:

Summary B:





FIRST provide a one-sentence comparison of the two summaries, explaining which \ you prefer and why. SECOND, on a new line, state only "A" or "B" to indicate your \ choice. Your response should use the format: Comparison: Preferred: <"A" or "B">





Summarization GPT-4 win rate prompt (C).





Which of the following summaries does a better job of summarizing the most \ important points in the given forum post, without including unimportant or \ irrelevant details? A good summary is both precise and concise.





Post:

<post>





Summary A:

<Summary A>





Summary B:

<Summary B>





FIRST provide a one-sentence comparison of the two summaries, explaining which \ you prefer and why. SECOND, on a new line, state only "A" or "B" to indicate your \ choice. Your response should use the format:





Comparison: <once-sentence comparison and explanation>





Preferred: <"A" or "B">





Dialogue GPT-4 win rate prompt.





For the following query to a chatbot, which response is more helpful?





Query: <the user query>





Response A:

<either the test method or baseline>









Response B:

<the other response>





FIRST provide a one-sentence comparison of the two responses and explain \ which you feel is more helpful. SECOND, on a new line, state only “A“ or \ “B“ to indicate which response is more helpful. Your response should use\ the format:





Comparison: <one-sentence comparison and explanation>





More helpful: <“A“ or “B“>





