B Details of Think-and-Execute
C Prompts Used in Our Experiments
D Human-written Pseudocode Prompts
F Generated Pseudocode Prompts
We conduct experiments to address the following research questions:
• RQ1: Is task-level pseudocode more helpful than instance-specific pseudocode?
• RQ2: Does pre-training on code corpora improve reasoning?
• RQ3: How is the quality of the logic discovered by THINK-AND-EXECUTE compared to human-written logic?
We conduct an analysis to check if the improvement of THINK-AND-EXECUTE is contributed by our chosen format for the task-level instruction, i.e., pseudocode. We compare THINKAND-EXECUTE with a concurrent work, Chain-of-Code (CoC) (Li et al., 2023). In Table 3, THINK-AND-EXECUTE outperforms CoC, showing about 2x improvement in the average score. The main difference between THINK-AND-EXECUTE and CoC is that we use pseudocodes which are generated to express logic shared among the tasks instances, while CoC incorporates pseudocode as part of the intermediate reasoning steps towards the solution of a given instance. Hence, the results indicate the advantages of applying pseudocode for the generation of task-level instruction over solely using them as a part of rationales.
To understand whether SLMs acquire the ability to understand the task-level logic written in pseudocode during pre-training on code corpora, we compare the performance of CodeLlama-13B with Llama-13B using THINK-AND-EXECUTE. In Figure 4, CodeLlama-13B shows better reasoning capabilities compared to Llama-13B in all tasks. These results suggest that the improvement from using THINK-AND-EXECUTE could depend on the knowledge of code, which is usually obtained by pre-training with code corpora. Writing code usually involves understanding the logic behind the given problem and expecting the execution results of a code, which resemble the same reasoning process of THINK-AND-EXECUTE.
To gauge LLMs’ capabilities in discerning the underlying logic of a task, we compare THINKAND-EXECUTE (using GPT-3.5-Turbo as the Instructor) with human-written pseudocode prompts. The results are shown in Table 4. Using the GPT-3.5-Turbo the Reasoner, THINKAND-EXECUTE scores 60.4% in terms of accuracy, which is superior to the human-written P (with an accuracy of 55.7%). Especially, in the tasks of Navigate and Tracking Shuffled Objectives, pseudocode prompts generated by THINK-AND-EXECUTE elicit better performance. This also holds true when adopting CodeLlama-7B and -13B as the Reasoner, further suggesting the effectiveness of our THINK step over human writers.
In examining the impact of LLMs’ capabilities within our framework, we investigate the influence of both the Reasoner and Instructor components on performance, as depicted in Table 5. Notably, higher accuracy scores are observed when utilizing GPT-3.5-Turbo as Reasoners compared to CodeLlama-13B and CodeLlama-34B. Additionally, the effectiveness
of the Instructor also plays a crucial role, with GPT-3.5-Turbo exhibiting the highest accuracy scores across all configurations. These results underscore the significance of both the Reasoner and Instructor components in enhancing the performance of THINK-AND-EXECUTE.
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.
Authors:
(1) Hyungjoo Chae, Yonsei University;
(2) Yeonghyeon Kim, Yonsei University;
(3) Seungone Kim, KAIST AI;
(4) Kai Tzu-iunn Ong, Yonsei University;
(5) Beong-woo Kwak, Yonsei University;
(6) Moohyeon Kim, Yonsei University;
(7) Seonghwan Kim, Yonsei University;
(8) Taeyoon Kwon, Yonsei University;
(9) Jiwan Chung, Yonsei University;
(10) Youngjae Yu, Yonsei University;
(11) Jinyoung Yeo, Yonsei University.