B Details of Think-and-Execute
C Prompts Used in Our Experiments
D Human-written Pseudocode Prompts
F Generated Pseudocode Prompts
We curate seven algorithmic reasoning tasks from Big-Bench Hard (Suzgun et al., 2022), including: dyck languages; geometric shapes; navigate; reasoning about colored objects; temporal sequence;tracking shuffled objectives; web of lies. These are specifically designed to measure the step-by-step reasoning capability of LLMs. Model performance on evaluated in zero-shot settings, where we do not provide demonstrations in the prompt. We provide detailed explanations in Appendix A.4.
We consider the following baselines: (1) Direct prompting: Directly predicting the answer without generating any rationales. (2) Zero-shot CoT (Kojima et al., 2022): A setting where LLMs are evoked to generate the reasoning steps with “Let’s think step by step”, before the answer. (3) Zero-shot PoT (Chen et al., 2023): A setting where an LLM generates an instancespecific Python code that can be executed with a Python interpreter. Then, the execution result is used as the final answer. (4) NL planning: A variation of THINK-AND-EXECUTE, where the task-level instruction is generated in natural language, instead of pseudocode.
For the Reasoner LM R, we adopt GPT-3.5-Turbo (OpenAI, 2023), which shows strong performance in various reasoning benchmarks and code generation tasks (Zellers et al., 2019; Cobbe et al., 2021; Muennighoff et al., 2024), as well as the 7B and 13B versions of CodeLlama (Roziere et al., 2023), which are trained on both code and natural language corpora and further fine-tuned to follow natural language instructions. As for the Instructor LM I, we choose GPT-3.5-Turbo.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Hyungjoo Chae, Yonsei University;
(2) Yeonghyeon Kim, Yonsei University;
(3) Seungone Kim, KAIST AI;
(4) Kai Tzu-iunn Ong, Yonsei University;
(5) Beong-woo Kwak, Yonsei University;
(6) Moohyeon Kim, Yonsei University;
(7) Seonghwan Kim, Yonsei University;
(8) Taeyoon Kwon, Yonsei University;
(9) Jiwan Chung, Yonsei University;
(10) Youngjae Yu, Yonsei University;
(11) Jinyoung Yeo, Yonsei University.