Authors:
(1) Chengrun Yang, Google DeepMind and Equal contribution;
(2) Xuezhi Wang, Google DeepMind;
(3) Yifeng Lu, Google DeepMind;
(4) Hanxiao Liu, Google DeepMind;
(5) Quoc V. Le, Google DeepMind;
(6) Denny Zhou, Google DeepMind;
(7) Xinyun Chen, Google DeepMind and Equal contribution. Table of Links Abstract and 1. Introduction 2 Opro: Llm as the Optimizer and 2.1 Desirables of Optimization by Llms 2.2 Meta-Prompt Design 3 Motivating Example: Mathematical Optimization and 3.1 Linear Regression 3.2 Traveling Salesman Problem (TSP) 4 Application: Prompt Optimization and 4.1 Problem Setup 4.2 Meta-Prompt Design 5 Prompt Optimization Experiments and 5.1 Evaluation Setup 5.2 Main Results 5.3 Ablation Studies 5.4 Overfitting Analysis in Prompt Optimization and 5.5 Comparison with Evoprompt 6 Related Work 7 Conclusion, Acknowledgments and References A Some Failure Cases B Prompting Formats for Scorer Llm C Meta-Prompts and C.1 Meta-Prompt for Math Optimization C.2 Meta-Prompt for Prompt Optimization D Prompt Optimization Curves on the Remaining Bbh Tasks E Prompt Optimization on Bbh Tasks – Tabulated Accuracies and Found Instructions 5 PROMPT OPTIMIZATION EXPERIMENTS We present the evaluation results for prompt optimization in this section. Our experiments demonstrate that OPRO brings a significant performance gain across the board, with different combinations of LLMs as the optimizer and the scorer. 5.1 EVALUATION SETUP Models. The LLMs we use as the optimizer and the scorer are: • Optimizer LLM: Pre-trained PaLM 2-L (Anil et al., 2023), instruction-tuned PaLM 2-L (denoted PaLM 2-L-IT), text-bison, gpt-3.5-turbo, and gpt-4. • Scorer LLM: Pre-trained PaLM 2-L and text-bison. With pre-trained PaLM 2-L as the scorer, the optimizer LLM generates A_begin instructions. Since text-bison has been instruction-tuned, the optimizer LLM generates Q_begin and Q_end instructions when text-bison is used as the scorer. Benchmarks. Our primary evaluation benchmarks are GSM8K (Cobbe et al., 2021) and Big-Bench Hard (BBH) (Suzgun et al., 2022). GSM8K is a benchmark of grade school math word problems with 7,473 training samples and 1,319 test samples, where chain-of-thought prompting (Wei et al., 2022) and the zero-shot instruction “Let’s think step by step.” (Kojima et al., 2022) have drastically improved the performance over the standard prompting. BBH is a suite of 23 challenging BIG-Bench tasks (Srivastava et al., 2022) that covers a wide range of topics beyond arithmetic reasoning, including symbolic manipulation and commonsense reasoning. Each task contains up to 250 examples in total. To examine the transferability of the optimized instructions, we also evaluate the instructions optimized for GSM8K on two other mathematical reasoning datasets, i.e., MultiArith (Roy & Roth, 2016) and AQuA (Ling et al., 2017). Implementation details. We set the temperature to be 0 when evaluating the performance of generated instructions, in which case the scorer LLM greedily decodes. Unless otherwise specified, we set the default temperature to be 1.0 for optimizer LLMs to generate diverse and creative instructions. At each optimization step, we prompt the optimizer LLM with the meta-prompt 8 times to generate 8 instructions, then we add these instructions with their training scores to the optimization trajectory in the meta-prompt. Our meta-prompt at each step contains the best 20 instructions so far and 3 randomly picked exemplars from the training set. We study the effect of different hyperparameters in ablation studies (Section 5.3). Appendix C.2 presents the full meta-prompts for different optimizer LLMs. This paper is available on arxiv under CC0 1.0 DEED license. Authors: (1) Chengrun Yang, Google DeepMind and Equal contribution; (2) Xuezhi Wang, Google DeepMind; (3) Yifeng Lu, Google DeepMind; (4) Hanxiao Liu, Google DeepMind; (5) Quoc V. Le, Google DeepMind; (6) Denny Zhou, Google DeepMind; (7) Xinyun Chen, Google DeepMind and Equal contribution. Authors: Authors: (1) Chengrun Yang, Google DeepMind and Equal contribution; (2) Xuezhi Wang, Google DeepMind; (3) Yifeng Lu, Google DeepMind; (4) Hanxiao Liu, Google DeepMind; (5) Quoc V. Le, Google DeepMind; (6) Denny Zhou, Google DeepMind; (7) Xinyun Chen, Google DeepMind and Equal contribution. Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Opro: Llm as the Optimizer and 2.1 Desirables of Optimization by Llms 2 Opro: Llm as the Optimizer and 2.1 Desirables of Optimization by Llms 2.2 Meta-Prompt Design 2.2 Meta-Prompt Design 3 Motivating Example: Mathematical Optimization and 3.1 Linear Regression 3 Motivating Example: Mathematical Optimization and 3.1 Linear Regression 3.2 Traveling Salesman Problem (TSP) 3.2 Traveling Salesman Problem (TSP) 4 Application: Prompt Optimization and 4.1 Problem Setup 4 Application: Prompt Optimization and 4.1 Problem Setup 4.2 Meta-Prompt Design 4.2 Meta-Prompt Design 5 Prompt Optimization Experiments and 5.1 Evaluation Setup 5 Prompt Optimization Experiments and 5.1 Evaluation Setup 5.2 Main Results 5.2 Main Results 5.3 Ablation Studies 5.3 Ablation Studies 5.4 Overfitting Analysis in Prompt Optimization and 5.5 Comparison with Evoprompt 5.4 Overfitting Analysis in Prompt Optimization and 5.5 Comparison with Evoprompt 6 Related Work 6 Related Work 7 Conclusion, Acknowledgments and References 7 Conclusion, Acknowledgments and References A Some Failure Cases A Some Failure Cases B Prompting Formats for Scorer Llm B Prompting Formats for Scorer Llm C Meta-Prompts and C.1 Meta-Prompt for Math Optimization C Meta-Prompts and C.1 Meta-Prompt for Math Optimization C.2 Meta-Prompt for Prompt Optimization C.2 Meta-Prompt for Prompt Optimization D Prompt Optimization Curves on the Remaining Bbh Tasks D Prompt Optimization Curves on the Remaining Bbh Tasks E Prompt Optimization on Bbh Tasks – Tabulated Accuracies and Found Instructions E Prompt Optimization on Bbh Tasks – Tabulated Accuracies and Found Instructions 5 PROMPT OPTIMIZATION EXPERIMENTS We present the evaluation results for prompt optimization in this section. Our experiments demonstrate that OPRO brings a significant performance gain across the board, with different combinations of LLMs as the optimizer and the scorer. 5.1 EVALUATION SETUP Models . The LLMs we use as the optimizer and the scorer are: Models • Optimizer LLM: Pre-trained PaLM 2-L (Anil et al., 2023), instruction-tuned PaLM 2-L (denoted PaLM 2-L-IT), text-bison, gpt-3.5-turbo, and gpt-4. • Scorer LLM: Pre-trained PaLM 2-L and text-bison. With pre-trained PaLM 2-L as the scorer, the optimizer LLM generates A_begin instructions. Since text-bison has been instruction-tuned, the optimizer LLM generates Q_begin and Q_end instructions when text-bison is used as the scorer. Benchmarks. Our primary evaluation benchmarks are GSM8K (Cobbe et al., 2021) and Big-Bench Hard (BBH) (Suzgun et al., 2022). GSM8K is a benchmark of grade school math word problems with 7,473 training samples and 1,319 test samples, where chain-of-thought prompting (Wei et al., 2022) and the zero-shot instruction “Let’s think step by step.” (Kojima et al., 2022) have drastically improved the performance over the standard prompting. BBH is a suite of 23 challenging BIG-Bench tasks (Srivastava et al., 2022) that covers a wide range of topics beyond arithmetic reasoning, including symbolic manipulation and commonsense reasoning. Each task contains up to 250 examples in total. Benchmarks. To examine the transferability of the optimized instructions, we also evaluate the instructions optimized for GSM8K on two other mathematical reasoning datasets, i.e., MultiArith (Roy & Roth, 2016) and AQuA (Ling et al., 2017). Implementation details. We set the temperature to be 0 when evaluating the performance of generated instructions, in which case the scorer LLM greedily decodes. Unless otherwise specified, we set the default temperature to be 1.0 for optimizer LLMs to generate diverse and creative instructions. At each optimization step, we prompt the optimizer LLM with the meta-prompt 8 times to generate 8 instructions, then we add these instructions with their training scores to the optimization trajectory in the meta-prompt. Our meta-prompt at each step contains the best 20 instructions so far and 3 randomly picked exemplars from the training set. We study the effect of different hyperparameters in ablation studies (Section 5.3). Appendix C.2 presents the full meta-prompts for different optimizer LLMs. This paper is available on arxiv under CC0 1.0 DEED license. This paper is available on arxiv under CC0 1.0 DEED license. available on arxiv

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

How OPRO Elevates LLM Accuracy in GSM8K and BBH Benchmarks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

The Noonification: How Often Do NFTs Pass The Howey Test? (1/13/2023)

Darwin's Hybrid Intelligence to Align AI & Human Goals for Startups & VCs

The Noonification: White Man (11/26/2022)

The Noonification: The Metaverse is a Sh*tshow (11/2/2022)

100 Days of AI Day 1: From Newsletter to Podcast, Leveraging AI for Audio Transformation

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps