B Details of Think-and-Execute
C Prompts Used in Our Experiments
D Human-written Pseudocode Prompts
F Generated Pseudocode Prompts
We use several LLMs, including GPT-3.5-Turbo (OpenAI, 2023) and GPT-4 (Achiam et al., 2023), which are available via OpenAI API[4], and open-source LLM, CodeLlama (Roziere et al., 2023) as the Instructor LM I and the Reasoner LM R.
• GPT-3.5-Turbo: gpt-3.5-turbo-0125
• GPT-4: gpt-4-0613
• CodeLlama: CodeLlama encompasses variations of LLaMA2 fine-tuned for code domains using code corpus. This comprehensive collection features models of various sizes (7B, 13B, 34B, and 70B) and diverse types, including the foundation model, Python-focused model, and instruction-following model. In our study, we employ the CodeLlama-Instruct model (7B[5], 13B[6]).
We use vLLM to improve inference throughput.[7] During our experiments, we adopt temperature sampling with T = 0.0 (i.e., greedy decoding) to efficiently generate outputs. For a task comprising 250 instances, GPT-3.5-Turbo achieves an inference time of 30 seconds. Additionally, utilizing 2 A100 GPUs, CodeLlama achieves inference times of approximately 2 and 5 minutes for 7B and 13B models, respectively.
To extract answers for evaluation, LLMs generate the final answer triggered by the phrase ”Final answer: ”. Following Suzgun et al. (2022), we provide all multiple-choice options to LLMs as input, then measure accuracy using exact match (EM), which compares the generated output with the ground-truth label. To ensure fair comparison between PoT and other baselines, we also admit the prediction that includes the text of correct choice, e.g., blue, but without a choice tag, e.g., ”(A)”.
We take 7 algorithmic benchmarks from Big-Bench Hard (Suzgun et al., 2022) dataset. All datasets contain 250 examples respectively. We provide the descriptions of each dataset regarding the goals and contexts.
• Dyck Languages (DL): Complete a partially given Dyck-4 sequence by predicting the necessary sequence of closing brackets that are missing at the end.
• Geometric Shapes (GS): Determine the geometric figure formed by following all the instructions in a specified SVG path element containing several commands.
• Navigate (Nav): Evaluate whether a set of directional commands will return a navigator to the starting point.
• Reasoning about Colored Objects (CO): Given a scenario, deduce the color of a specific object placed on a surface, using the provided context for guidance.
• Temporal Sequences (TS): Examine a chronology of a person’s daily activities to find when they could fit an additional activity into their schedule.
• Tracking Shuffled Objectives (SO): Ascertain the final positions of several objects after they have been moved from their original locations through a sequence of exchanges. We use the version of the task with 5 objectives.
• Web of Lies (WL): Assess the veracity of a Boolean function presented within a narrative problem to establish its truthfulness.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
[4] https://openai.com/blog/openai-api
[5] https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf
[6] https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf
[7] https://github.com/vllm-project/vllm
Authors:
(1) Hyungjoo Chae, Yonsei University;
(2) Yeonghyeon Kim, Yonsei University;
(3) Seungone Kim, KAIST AI;
(4) Kai Tzu-iunn Ong, Yonsei University;
(5) Beong-woo Kwak, Yonsei University;
(6) Moohyeon Kim, Yonsei University;
(7) Seonghwan Kim, Yonsei University;
(8) Taeyoon Kwon, Yonsei University;
(9) Jiwan Chung, Yonsei University;
(10) Youngjae Yu, Yonsei University;
(11) Jinyoung Yeo, Yonsei University.