Table of Links Abstract and 1. Introduction Abstract and 1. Introduction 2 Think-and-Execute 2 Think-and-Execute 3 Experimental Setup 3 Experimental Setup 4 Results 4 Results 5 Analysis 5 Analysis 6 Related Work 6 Related Work 7 Limitations and Discussion 7 Limitations and Discussion 8 Conclusion and References 8 Conclusion and References A Experimental Details A Experimental Details B Details of Think-and-Execute B Details of Think-and-Execute C Prompts Used in Our Experiments C Prompts Used in Our Experiments D Human-written Pseudocode Prompts D Human-written Pseudocode Prompts E Generated Analyses E Generated Analyses F Generated Pseudocode Prompts F Generated Pseudocode Prompts G Qualitative Analysis G Qualitative Analysis A Experimental Details A.1 Models We use several LLMs, including GPT-3.5-Turbo (OpenAI, 2023) and GPT-4 (Achiam et al., 2023), which are available via OpenAI API[4], and open-source LLM, CodeLlama (Roziere et al., 2023) as the Instructor LM I and the Reasoner LM R. • GPT-3.5-Turbo: gpt-3.5-turbo-0125 • GPT-3.5-Turbo • GPT-4: gpt-4-0613 • GPT-4 • CodeLlama: CodeLlama encompasses variations of LLaMA2 fine-tuned for code domains using code corpus. This comprehensive collection features models of various sizes (7B, 13B, 34B, and 70B) and diverse types, including the foundation model, Python-focused model, and instruction-following model. In our study, we employ the CodeLlama-Instruct model (7B[5], 13B[6]). CodeLlama A.2 Inference We use vLLM to improve inference throughput.[7] During our experiments, we adopt temperature sampling with T = 0.0 (i.e., greedy decoding) to efficiently generate outputs. For a task comprising 250 instances, GPT-3.5-Turbo achieves an inference time of 30 seconds. Additionally, utilizing 2 A100 GPUs, CodeLlama achieves inference times of approximately 2 and 5 minutes for 7B and 13B models, respectively. A.3 Evaluation To extract answers for evaluation, LLMs generate the final answer triggered by the phrase ”Final answer: ”. Following Suzgun et al. (2022), we provide all multiple-choice options to LLMs as input, then measure accuracy using exact match (EM), which compares the generated output with the ground-truth label. To ensure fair comparison between PoT and other baselines, we also admit the prediction that includes the text of correct choice, e.g., blue, but without a choice tag, e.g., ”(A)”. A.4 Datasets We take 7 algorithmic benchmarks from Big-Bench Hard (Suzgun et al., 2022) dataset. All datasets contain 250 examples respectively. We provide the descriptions of each dataset regarding the goals and contexts. • Dyck Languages (DL): Complete a partially given Dyck-4 sequence by predicting the necessary sequence of closing brackets that are missing at the end. • Dyck Languages (DL) • Geometric Shapes (GS): Determine the geometric figure formed by following all the instructions in a specified SVG path element containing several commands. • Geometric Shapes (GS) • Navigate (Nav): Evaluate whether a set of directional commands will return a navigator to the starting point. • Navigate (Nav) • Reasoning about Colored Objects (CO): Given a scenario, deduce the color of a specific object placed on a surface, using the provided context for guidance. • Reasoning about Colored Objects (CO) • Temporal Sequences (TS): Examine a chronology of a person’s daily activities to find when they could fit an additional activity into their schedule. • Temporal Sequences (TS) • Tracking Shuffled Objectives (SO): Ascertain the final positions of several objects after they have been moved from their original locations through a sequence of exchanges. We use the version of the task with 5 objectives. • Tracking Shuffled Objectives (SO) • Web of Lies (WL): Assess the veracity of a Boolean function presented within a narrative problem to establish its truthfulness. • Web of Lies (WL) This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license. available on arxiv [4] https://openai.com/blog/openai-api [5] https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf [6] https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf [7] https://github.com/vllm-project/vllm Authors: (1) Hyungjoo Chae, Yonsei University; (2) Yeonghyeon Kim, Yonsei University; (3) Seungone Kim, KAIST AI; (4) Kai Tzu-iunn Ong, Yonsei University; (5) Beong-woo Kwak, Yonsei University; (6) Moohyeon Kim, Yonsei University; (7) Seonghwan Kim, Yonsei University; (8) Taeyoon Kwon, Yonsei University; (9) Jiwan Chung, Yonsei University; (10) Youngjae Yu, Yonsei University; (11) Jinyoung Yeo, Yonsei University. Authors: Authors: (1) Hyungjoo Chae, Yonsei University; (2) Yeonghyeon Kim, Yonsei University; (3) Seungone Kim, KAIST AI; (4) Kai Tzu-iunn Ong, Yonsei University; (5) Beong-woo Kwak, Yonsei University; (6) Moohyeon Kim, Yonsei University; (7) Seonghwan Kim, Yonsei University; (8) Taeyoon Kwon, Yonsei University; (9) Jiwan Chung, Yonsei University; (10) Youngjae Yu, Yonsei University; (11) Jinyoung Yeo, Yonsei University.