paint-brush
How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarksby@textmodels
148 reads

How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks

tldt arrow

Too Long; Didn't Read

OPRO enhances LLM performance in tasks like GSM8K and BBH by optimizing instructions using models like PaLM 2-L and GPT-4. Experiments reveal improved transferability of optimized instructions across mathematical reasoning datasets.
featured image - How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks
Writings, Papers and Blogs on Text Models HackerNoon profile picture

Authors:

(1) Chengrun Yang, Google DeepMind and Equal contribution;

(2) Xuezhi Wang, Google DeepMind;

(3) Yifeng Lu, Google DeepMind;

(4) Hanxiao Liu, Google DeepMind;

(5) Quoc V. Le, Google DeepMind;

(6) Denny Zhou, Google DeepMind;

(7) Xinyun Chen, Google DeepMind and Equal contribution.

Abstract and 1. Introduction

2 Opro: Llm as the Optimizer and 2.1 Desirables of Optimization by Llms

2.2 Meta-Prompt Design

3 Motivating Example: Mathematical Optimization and 3.1 Linear Regression

3.2 Traveling Salesman Problem (TSP)

4 Application: Prompt Optimization and 4.1 Problem Setup

4.2 Meta-Prompt Design

5 Prompt Optimization Experiments and 5.1 Evaluation Setup

5.2 Main Results

5.3 Ablation Studies

5.4 Overfitting Analysis in Prompt Optimization and 5.5 Comparison with Evoprompt

6 Related Work

7 Conclusion, Acknowledgments and References

A Some Failure Cases

B Prompting Formats for Scorer Llm

C Meta-Prompts and C.1 Meta-Prompt for Math Optimization

C.2 Meta-Prompt for Prompt Optimization

D Prompt Optimization Curves on the Remaining Bbh Tasks

E Prompt Optimization on Bbh Tasks – Tabulated Accuracies and Found Instructions

5 PROMPT OPTIMIZATION EXPERIMENTS

We present the evaluation results for prompt optimization in this section. Our experiments demonstrate that OPRO brings a significant performance gain across the board, with different combinations of LLMs as the optimizer and the scorer.

5.1 EVALUATION SETUP

Models. The LLMs we use as the optimizer and the scorer are:


• Optimizer LLM: Pre-trained PaLM 2-L (Anil et al., 2023), instruction-tuned PaLM 2-L (denoted PaLM 2-L-IT), text-bison, gpt-3.5-turbo, and gpt-4.


• Scorer LLM: Pre-trained PaLM 2-L and text-bison.


With pre-trained PaLM 2-L as the scorer, the optimizer LLM generates A_begin instructions. Since text-bison has been instruction-tuned, the optimizer LLM generates Q_begin and Q_end instructions when text-bison is used as the scorer.


Benchmarks. Our primary evaluation benchmarks are GSM8K (Cobbe et al., 2021) and Big-Bench Hard (BBH) (Suzgun et al., 2022). GSM8K is a benchmark of grade school math word problems with 7,473 training samples and 1,319 test samples, where chain-of-thought prompting (Wei et al., 2022) and the zero-shot instruction “Let’s think step by step.” (Kojima et al., 2022) have drastically improved the performance over the standard prompting. BBH is a suite of 23 challenging BIG-Bench tasks (Srivastava et al., 2022) that covers a wide range of topics beyond arithmetic reasoning, including symbolic manipulation and commonsense reasoning. Each task contains up to 250 examples in total.


To examine the transferability of the optimized instructions, we also evaluate the instructions optimized for GSM8K on two other mathematical reasoning datasets, i.e., MultiArith (Roy & Roth, 2016) and AQuA (Ling et al., 2017).


Implementation details. We set the temperature to be 0 when evaluating the performance of generated instructions, in which case the scorer LLM greedily decodes. Unless otherwise specified, we set the default temperature to be 1.0 for optimizer LLMs to generate diverse and creative instructions. At each optimization step, we prompt the optimizer LLM with the meta-prompt 8 times to generate 8 instructions, then we add these instructions with their training scores to the optimization trajectory in the meta-prompt. Our meta-prompt at each step contains the best 20 instructions so far and 3 randomly picked exemplars from the training set. We study the effect of different hyperparameters in ablation studies (Section 5.3). Appendix C.2 presents the full meta-prompts for different optimizer LLMs.


This paper is available on arxiv under CC0 1.0 DEED license.