How We Curated Seven Algorithmic Reasoning Tasks From Big-Bench Hard

Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Think-and-Execute 2 Think-and-Execute 3 Experimental Setup 3 Experimental Setup 4 Results 4 Results 5 Analysis 5 Analysis 6 Related Work 6 Related Work 7 Limitations and Discussion 7 Limitations and Discussion 8 Conclusion and References 8 Conclusion and References A Experimental Details A Experimental Details B Details of Think-and-Execute B Details of Think-and-Execute C Prompts Used in Our Experiments C Prompts Used in Our Experiments D Human-written Pseudocode Prompts D Human-written Pseudocode Prompts E Generated Analyses E Generated Analyses F Generated Pseudocode Prompts F Generated Pseudocode Prompts G Qualitative Analysis G Qualitative Analysis 3 Experimental Setup 3.1 Datasets We curate seven algorithmic reasoning tasks from Big-Bench Hard (Suzgun et al., 2022), including: dyck languages; geometric shapes; navigate; reasoning about colored objects; temporal sequence;tracking shuffled objectives; web of lies. These are specifically designed to measure the step-by-step reasoning capability of LLMs. Model performance on evaluated in zero-shot settings, where we do not provide demonstrations in the prompt. We provide detailed explanations in Appendix A.4. 3.2 Baselines We consider the following baselines: (1) Direct prompting: Directly predicting the answer without generating any rationales. (2) Zero-shot CoT (Kojima et al., 2022): A setting where LLMs are evoked to generate the reasoning steps with “Let’s think step by step”, before the answer. (3) Zero-shot PoT (Chen et al., 2023): A setting where an LLM generates an instancespecific Python code that can be executed with a Python interpreter. Then, the execution result is used as the final answer. (4) NL planning: A variation of THINK-AND-EXECUTE, where the task-level instruction is generated in natural language, instead of pseudocode. Direct prompting Zero-shot CoT Zero-shot PoT NL planning 3.3 Models For the Reasoner LM R, we adopt GPT-3.5-Turbo (OpenAI, 2023), which shows strong performance in various reasoning benchmarks and code generation tasks (Zellers et al., 2019; Cobbe et al., 2021; Muennighoff et al., 2024), as well as the 7B and 13B versions of CodeLlama (Roziere et al., 2023), which are trained on both code and natural language corpora and further fine-tuned to follow natural language instructions. As for the Instructor LM I, we choose GPT-3.5-Turbo. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv Authors: (1) Hyungjoo Chae, Yonsei University; (2) Yeonghyeon Kim, Yonsei University; (3) Seungone Kim, KAIST AI; (4) Kai Tzu-iunn Ong, Yonsei University; (5) Beong-woo Kwak, Yonsei University; (6) Moohyeon Kim, Yonsei University; (7) Seonghwan Kim, Yonsei University; (8) Taeyoon Kwon, Yonsei University; (9) Jiwan Chung, Yonsei University; (10) Youngjae Yu, Yonsei University; (11) Jinyoung Yeo, Yonsei University. Authors: Authors: (1) Hyungjoo Chae, Yonsei University; (2) Yeonghyeon Kim, Yonsei University; (3) Seungone Kim, KAIST AI; (4) Kai Tzu-iunn Ong, Yonsei University; (5) Beong-woo Kwak, Yonsei University; (6) Moohyeon Kim, Yonsei University; (7) Seonghwan Kim, Yonsei University; (8) Taeyoon Kwon, Yonsei University; (9) Jiwan Chung, Yonsei University; (10) Youngjae Yu, Yonsei University; (11) Jinyoung Yeo, Yonsei University.