paint-brush
Think-and-Execute: The Experimental Detailsby@transcompiler
New Story

Think-and-Execute: The Experimental Details

tldt arrow

Too Long; Didn't Read

We use several LLMs, including GPT-3.5-Turbo and GPT-4, which are available via OpenAI API[4], and open-source LLM, CodeLlama
featured image - Think-and-Execute: The Experimental Details
Transcompiler: Learn How to Translate Code HackerNoon profile picture
0-item

Abstract and 1. Introduction

2 Think-and-Execute

3 Experimental Setup

4 Results

5 Analysis

6 Related Work

7 Limitations and Discussion

8 Conclusion and References


A Experimental Details

B Details of Think-and-Execute

C Prompts Used in Our Experiments

D Human-written Pseudocode Prompts

E Generated Analyses

F Generated Pseudocode Prompts

G Qualitative Analysis

A Experimental Details

A.1 Models

We use several LLMs, including GPT-3.5-Turbo (OpenAI, 2023) and GPT-4 (Achiam et al., 2023), which are available via OpenAI API[4], and open-source LLM, CodeLlama (Roziere et al., 2023) as the Instructor LM I and the Reasoner LM R.


• GPT-3.5-Turbo: gpt-3.5-turbo-0125


• GPT-4: gpt-4-0613


CodeLlama: CodeLlama encompasses variations of LLaMA2 fine-tuned for code domains using code corpus. This comprehensive collection features models of various sizes (7B, 13B, 34B, and 70B) and diverse types, including the foundation model, Python-focused model, and instruction-following model. In our study, we employ the CodeLlama-Instruct model (7B[5], 13B[6]).

A.2 Inference

We use vLLM to improve inference throughput.[7] During our experiments, we adopt temperature sampling with T = 0.0 (i.e., greedy decoding) to efficiently generate outputs. For a task comprising 250 instances, GPT-3.5-Turbo achieves an inference time of 30 seconds. Additionally, utilizing 2 A100 GPUs, CodeLlama achieves inference times of approximately 2 and 5 minutes for 7B and 13B models, respectively.

A.3 Evaluation

To extract answers for evaluation, LLMs generate the final answer triggered by the phrase ”Final answer: ”. Following Suzgun et al. (2022), we provide all multiple-choice options to LLMs as input, then measure accuracy using exact match (EM), which compares the generated output with the ground-truth label. To ensure fair comparison between PoT and other baselines, we also admit the prediction that includes the text of correct choice, e.g., blue, but without a choice tag, e.g., ”(A)”.

A.4 Datasets

We take 7 algorithmic benchmarks from Big-Bench Hard (Suzgun et al., 2022) dataset. All datasets contain 250 examples respectively. We provide the descriptions of each dataset regarding the goals and contexts.


• Dyck Languages (DL): Complete a partially given Dyck-4 sequence by predicting the necessary sequence of closing brackets that are missing at the end.


• Geometric Shapes (GS): Determine the geometric figure formed by following all the instructions in a specified SVG path element containing several commands.


• Navigate (Nav): Evaluate whether a set of directional commands will return a navigator to the starting point.


• Reasoning about Colored Objects (CO): Given a scenario, deduce the color of a specific object placed on a surface, using the provided context for guidance.


• Temporal Sequences (TS): Examine a chronology of a person’s daily activities to find when they could fit an additional activity into their schedule.


• Tracking Shuffled Objectives (SO): Ascertain the final positions of several objects after they have been moved from their original locations through a sequence of exchanges. We use the version of the task with 5 objectives.


• Web of Lies (WL): Assess the veracity of a Boolean function presented within a narrative problem to establish its truthfulness.


This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.


[4] https://openai.com/blog/openai-api


[5] https://huggingface.co/codellama/CodeLlama-7b-Instruct-hf


[6] https://huggingface.co/codellama/CodeLlama-13b-Instruct-hf


[7] https://github.com/vllm-project/vllm

Authors:

(1) Hyungjoo Chae, Yonsei University;

(2) Yeonghyeon Kim, Yonsei University;

(3) Seungone Kim, KAIST AI;

(4) Kai Tzu-iunn Ong, Yonsei University;

(5) Beong-woo Kwak, Yonsei University;

(6) Moohyeon Kim, Yonsei University;

(7) Seonghwan Kim, Yonsei University;

(8) Taeyoon Kwon, Yonsei University;

(9) Jiwan Chung, Yonsei University;

(10) Youngjae Yu, Yonsei University;

(11) Jinyoung Yeo, Yonsei University.