paint-brush
Our Analysis on Think-and-Execute and Pseudocodeby@transcompiler
New Story

Our Analysis on Think-and-Execute and Pseudocode

tldt arrow

Too Long; Didn't Read

We conduct experiments to address the following research questions: Is task-level pseudocode more helpful than instance-specific pseudocode? Does pre-training on code corpora improve reasoning? How is the quality of the logic discovered by THINK-AND-EXECUTE compared to human-written logic?
featured image - Our Analysis on Think-and-Execute and Pseudocode
Transcompiler: Learn How to Translate Code HackerNoon profile picture
0-item

Abstract and 1. Introduction

2 Think-and-Execute

3 Experimental Setup

4 Results

5 Analysis

6 Related Work

7 Limitations and Discussion

8 Conclusion and References


A Experimental Details

B Details of Think-and-Execute

C Prompts Used in Our Experiments

D Human-written Pseudocode Prompts

E Generated Analyses

F Generated Pseudocode Prompts

G Qualitative Analysis

5 Analysis

We conduct experiments to address the following research questions:


RQ1: Is task-level pseudocode more helpful than instance-specific pseudocode?


RQ2: Does pre-training on code corpora improve reasoning?


RQ3: How is the quality of the logic discovered by THINK-AND-EXECUTE compared to human-written logic?

5.1 Implementing the Underlying Logic is more Effective than Instance-specific Logic in Pseudocode (RQ1)

We conduct an analysis to check if the improvement of THINK-AND-EXECUTE is contributed by our chosen format for the task-level instruction, i.e., pseudocode. We compare THINKAND-EXECUTE with a concurrent work, Chain-of-Code (CoC) (Li et al., 2023). In Table 3, THINK-AND-EXECUTE outperforms CoC, showing about 2x improvement in the average score. The main difference between THINK-AND-EXECUTE and CoC is that we use pseudocodes which are generated to express logic shared among the tasks instances, while CoC incorporates pseudocode as part of the intermediate reasoning steps towards the solution of a given instance. Hence, the results indicate the advantages of applying pseudocode for the generation of task-level instruction over solely using them as a part of rationales.


Figure 4: Analysis on the effect of code pre-training on the reasoning capability in applying THINK-AND-EXECUTE. Without pre-training on code corpora the accuracies drop notably.


Table 4: Comparison between THINK-AND-EXECUTE and Human-written P.

5.2 THINK-AND-EXECUTE Requires Knowledge in Code (RQ2)

To understand whether SLMs acquire the ability to understand the task-level logic written in pseudocode during pre-training on code corpora, we compare the performance of CodeLlama-13B with Llama-13B using THINK-AND-EXECUTE. In Figure 4, CodeLlama-13B shows better reasoning capabilities compared to Llama-13B in all tasks. These results suggest that the improvement from using THINK-AND-EXECUTE could depend on the knowledge of code, which is usually obtained by pre-training with code corpora. Writing code usually involves understanding the logic behind the given problem and expecting the execution results of a code, which resemble the same reasoning process of THINK-AND-EXECUTE.

5.3 THINK-AND-EXECUTE can Generate a Logic Comparable to Human’s (RQ3)

To gauge LLMs’ capabilities in discerning the underlying logic of a task, we compare THINKAND-EXECUTE (using GPT-3.5-Turbo as the Instructor) with human-written pseudocode prompts. The results are shown in Table 4. Using the GPT-3.5-Turbo the Reasoner, THINKAND-EXECUTE scores 60.4% in terms of accuracy, which is superior to the human-written P (with an accuracy of 55.7%). Especially, in the tasks of Navigate and Tracking Shuffled Objectives, pseudocode prompts generated by THINK-AND-EXECUTE elicit better performance. This also holds true when adopting CodeLlama-7B and -13B as the Reasoner, further suggesting the effectiveness of our THINK step over human writers.

5.4 Impact of LLMs’ Capability on THINK-AND-EXECUTE

In examining the impact of LLMs’ capabilities within our framework, we investigate the influence of both the Reasoner and Instructor components on performance, as depicted in Table 5. Notably, higher accuracy scores are observed when utilizing GPT-3.5-Turbo as Reasoners compared to CodeLlama-13B and CodeLlama-34B. Additionally, the effectiveness


Table 5: Analysis of the effect of the capability of Reasoner and Instructor on the performance. We report the average performance on the 7 tasks.


of the Instructor also plays a crucial role, with GPT-3.5-Turbo exhibiting the highest accuracy scores across all configurations. These results underscore the significance of both the Reasoner and Instructor components in enhancing the performance of THINK-AND-EXECUTE.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.

Authors:

(1) Hyungjoo Chae, Yonsei University;

(2) Yeonghyeon Kim, Yonsei University;

(3) Seungone Kim, KAIST AI;

(4) Kai Tzu-iunn Ong, Yonsei University;

(5) Beong-woo Kwak, Yonsei University;

(6) Moohyeon Kim, Yonsei University;

(7) Seonghwan Kim, Yonsei University;

(8) Taeyoon Kwon, Yonsei University;

(9) Jiwan Chung, Yonsei University;

(10) Youngjae Yu, Yonsei University;

(11) Jinyoung Yeo, Yonsei University.