New AI Plans, Learns, and Adapts in Real Time—One Task at a Time

Authors: (1) Sirui Hong, DeepWisdom and these authors contributed equally to this work; (2) Yizhang Lin, DeepWisdom and these authors contributed equally to this work; (3) Bang Liu, Universite de Montreal & Mila and these author are listed in alphabetical order; (4) Bangbang Liu, DeepWisdom and these authors contributed equally to this work; (5) Binhao Wu, DeepWisdom and these authors contributed equally to this work; (6) Danyang Li, DeepWisdom and these authors contributed equally to this work; (7) Jiaqi Chen, Fudan University and these authors contributed equally to this work; (8) Jiayi Zhang, Renmin University of China and these authors contributed equally to this work; (9) Jinlin Wang, DeepWisdom and these authors contributed equally to this work; (10) Li Zhang, Fudan University and these authors contributed equally to this work; (11) Lingyao Zhang, these authors contributed equally to this work; (12) Min Yang, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences and these authors contributed equally to this work; (13) Mingchen Zhuge, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (14) Taicheng Guo, University of Notre Dame and these authors contributed equally to this work; (15) Tuo Zhou, The University of Hong Kong and these authors contributed equally to this work; (16) Wei Tao, Fudan University and these authors contributed equally to this work; (17) Wenyi Wang, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (18) Xiangru Tang, Yale University and these authors contributed equally to this work; (19) Xiangtao Lu, DeepWisdom and these authors contributed equally to this work; (20) Xiawu Zheng, Xiamen University and these authors contributed equally to this work; (21) Xinbing Liang, DeepWisdom, East China Normal University and these authors contributed equally to this work; (22) Yaying Fei, Beijing University of Technology and these authors contributed equally to this work; (23) Yuheng Cheng, The Chinese University of Hong Kong, Shenzhen and these authors contributed equally to this work; (24) Zongze Xu, DeepWisdom, Hohai University and these authors contributed equally to this work; (25) Chenglin Wu, DeepWisdom and a corresponding author. Editor's Note: This is Part 3 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Table of Links Abstract and 1 Introduction 2 Related Work 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 4 Experiments 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 5 Conclusion and References A. Additional Results B. Implementation Results C. Details of Datasets 3 METHODOLOGY Our Data Interpreter uses dynamic planning within a hierarchical structure for real-time goal adjustment, as shown in Figure 2. More information can be found in Section 3.1. Then, Data Interpreter completes tasks by breaking them down as outlined in the plan and executing code, with tools incorporated as necessary to augment proficiency. Detailed explanations are provided in Section 3.2. Moreover, each task is subjected to a validation process to ensure its reliability. The process of executing a task is then characterized and analyzed as an experience, which can be retrieved for similar tasks in the future. More information on this mechanism is provided in Section 3.3. 3.1 DYNAMIC PLANNING WITH HIERARCHICAL STRUCTURE The intensive data dependence complicates modeling and orchestrating data science pipelines. In this section, we outline these pipelines naturally organized by the graph, modeling them with the hierarchical structure, and introduce dynamic plan management for effective orchestration. 3.1.1 HIERARCHICAL GRAPH FOR DATA SCIENCE PROBLEMS Data science projects encompass extensive detailing and long-range pipelines, complicating the direct planning of all detailed tasks and coding. This complexity necessitates careful planning, execution, and management (Biswas et al., 2022). Drawing inspiration from the application of hierarchical planning in automated machine learning tasks (Mohr et al., 2018; Mubarak & Koeshidayatullah, 2023), we organize the data science pipelines via hierarchical structure, which initially decomposes the intricate data science problem into manageable tasks and further break down each task into specific actions executed through code (in Figure 3(a)). Figure 3(b) illustrates the hierarchical data science task pipelines composed of data exploration, correlation analysis, outlier detection, feature engineering, model training, model evaluation, and result visualization tasks (green region in Figure 3 (b)). Alongside the task graph, code as an implicit graph represents the corresponding execution actions of the code generated by LLMs (the purple region in Figure 3 (b)). This hierarchical graph structure facilitates structured problem-solving for our Data Interpreter and adeptly captures both sequential task relationships (such as from model training to model evaluation and visualization) and parallel task relationships (such as outliers detection and correlation analysis). Therefore, we propose structuring data science workflows as a hierarchical directed acyclic graph (DAG), aptly representing data science pipelines at both task and coding levels. Our Data Interpreter leverages the advanced planning capabilities of LLMs to decompose the complex data science problem into multiple tasks consistent with the problem goal and express their executing dependencies through a graph structure. We design the metadata for each task node, including task description, completion status, and code. Appendix B.2 describes more details about the task node. 3.1.2 DYNAMIC PLAN MANAGEMENT Leveraging a hierarchical graph structure, tasks are executed automatically. Unlike previous methods (Wei et al., 2022; Besta et al., 2023; Yao et al., 2022) that create and execute plans once before execution for static problems, we investigate that in intensive data dependence scenario, the intermediate data among tasks will be dynamically changed during execution due to tool operations or new information in the workflow, which can lead to runtime errors if the data does not match the pre-defined plan. To tackle this, we introduce a dynamic plan management, detailed in Figure 4. To ensure efficient progress execution and facilitate plan modifications, our Data Interpreter dynamically updates the corresponding code, execution result, and status of each node in the task graph following each execution. A task is considered completed by successfully executing the corresponding code. Once completed, the task is marked as “Success” and added to a completed tasks list, proceeding to the next task according to the plan. On the contrary, the task is marked as “Failure” if it fails. We have designed two strategies: Self-debugging and Human editing, aimed at enhancing autonomous completeness and correctness. In the event of task failure, Self-debugging is enabled, utilizing LLMs to debug the code based on runtime errors, up to a predefined number of attempts. If the task remains unresolved, it is flagged as “Failure”. Due to the high logic requirements of data science problems, an additional human-in-the-loop approach, human editing, is introduced to ensure code precision. When Human editing is activated, our Data Interpreter holds the task until it is manually modified, upon which it is rerun based on human input. The Data Interpreter will regenerate the plan for failed or manually edited tasks based on current episodic memory and the context of execution. Specifically, the regenerated task graph is sorted in topological order and then compared to the original task graph using a prefix matching algorithm (Waldvogel, 2000) to identify any differences in instructions. Based on this comparison, the fork can be identified. The final output of this process includes all unchanged tasks existing before the fork and any new tasks added or modified after the fork. Throughout execution, our Data Interpreter monitors the dynamic task graph, promptly removing failed tasks, generating refined tasks, and updating the graph. This avoids the inefficiency of generating fine-grained planning tasks at once and improves the success rate of plans requiring multi-step execution, making it better suited for scenarios where the data flow constantly changes in data science problems. 3.2 TOOL UTILIZATION AND GENERATION To address the intricate nature of tasks that are too complex to be entirely coded from scratch, utilizing existing toolkits or integrating existing code snippets becomes essential. Take, for example, feature engineering, which demands domain-specific expertise for data transformation. In such cases, using tools crafted by experts can be significantly more effective, as generating this type of code directly through LLMs poses considerable challenges. Similarly, email processing involves orchestrating different code snippets to establish efficient workflows. To improve the efficiency of using these tools, we suggest a two-pronged method: one focuses on recommending or generating the most suitable tools, while the other organizes these tools effectively. This approach offers clear advantages over previous methods (Schick et al., 2024; Hong et al., 2023), which relied on mere library calls or did not incorporate tools using clear modularization in the code. By combining the strengths and mitigating the weaknesses of these methods, our approach presents a more balanced and efficient solution. Notice that tool usages follow the principles and procedures of the task graph described in Section 3.1.1; the use of tools itself would be considered one of the tasks in the graph. 3.2.1 TOOL RECOMMENDATION AND ORGANIZATION In tool recommendations, the Data Interpreter classifies tools based on task descriptions and types. This process effectively narrows down the pool of potential tools, making the selection process more efficient for subsequent tasks. It then identifies the top-k tools that best fit the tasks by evaluating the compatibilities of the candidate tools with one task. Additionally, we incorporate a tool schema to help LLMs understand these tools’ functionalities and use cases, embedding this schema during execution phases as outlined in Appendix B.3.1. This schema-guided understanding enables more accurate tool selection and application. Besides, during execution, the algorithm dynamically adjusts tool parameters using LLMs, considering the code context and task requirements. This dynamic parameter adjustment improves tool adaptability to the tasks at hand. In tool organizations, our method employs LLMs to seamlessly integrate tools into the code, optimally positioning them based on a thorough analysis of the tool functions. This is particularly useful for complex tasks such as feature engineering, facilitating a process that is efficient and adaptable to the integration of tools. We refine this process by considering the context of the current task and the tools at our disposal. The LLM is directed to craft code that not only invokes the required tool functions but also seamlessly integrates these calls with other aspects of the code. This allows for dynamic orchestration of various tools tailored to the specific requirements of the encoding process. A prime example of this adaptability is demonstrated with the CatCount tool in our deployment pipeline (Figure 5), showcasing the dynamic use of its fit and transform functions according to the task context. This strategy ensures that tool integration is automated and precisely aligned with task demands, significantly boosting coding efficiency and flexibility. 3.2.2 CONTINUOUS TOOL EVOLUTION To minimize the frequency of debugging and improve execution efficiency, our model learns from experience during task execution. After each task, it abstracts tools by distilling their core functionalities, stripping away any sample-specific logic. This creates versatile, generic tool functions that are added to the library for future use. In addition, the Data Interpreter automatically ensures the reliability of these tools by conducting rigorous unit tests and leveraging its self-debugging capabilities through LLMs. Consequently, Data Interpreter facilitates rapidly transforming sample-specific code snippets into reusable tool functions, continuously improving its toolkit and coding expertise over time. 3.3 ENHANCING REASONING WITH VERIFICATION AND EXPERIENCE Our designed task graph, dynamic plan management, and tool utilization can improve task planning and tool mastery. However, relying only on error detection or capturing exceptions is inadequate feedback to complete a task. For complex reasoning problems, even if the code runs without raising errors, it can still contain logical flaws (Wang et al., 2023; Zhou et al., 2023a).Therefore, in this section, we introduce automated confidence-based verification and leverage experience further to improve the correctness and efficiency of the reasoning results. 3.3.1 AUTOMATED CONFIDENCE-BASED VERIFICATION To address this issue, we propose a simple yet effective technique, Automated Confidence-based Verification (ACV) , which introduces an interpretation layer between the environment and the Data Interpreter. This approach allows LLMs to evaluate code execution results and determine if the code solution is mathematically rigorous or logically correct. Specifically, once a code solution for the task starts to be executed, the Data Interpreter is required to generate a validation code to ensure that the output result complies with the task requirement. The validation code is designed to simulate the logical process according to the task description and to verify the correctness of the result generated by the code. This process is similar to performing white-box testing on each task, guaranteeing that the code produces the expected output. The confidence score helps the Data Interpreter choose a more accurate result as the final answer by ranking the average confidence scores corresponding to different execution results. A specific example of this automated confidence-based verification process from the MATH dataset is shown in Figure 6. The validation code takes into account both the task, the code, and its execution result. The function is prime is to check the code, the probability is generated from the task description, and given answer is the candidate answer. In this example, the process undergoes five separate verifications. Specifically, the results of the code execution (that is, the candidate response) for the first and fifth validations are 1/108. For the remaining verifications, the results consistently are 56/219. The candidate answer (i.e. 1/108) gets two confidence scores (0.2 and 1), with an average of 0.6. Another candidate answer (i.e. 56/219) gets three confidence scores (0.2, 0.5, and 0.2), with an average of 0.3. As the former gets a higher average confidence score, our Data Interpreter selects 1/108 as the final answer, and it is correct. In contrast, simply using the majority voting strategy will choose the latter, which is wrong. 3.3.2 EXPERIENCE-DRIVEN REASONING As the automated confidence-based verification makes the task-solving process more transparent and reliable, the data generated in the verification can be reused as experience for other tasks. Therefore, we improve the Data Interpreter’s adaptability through a reflective analysis that allows tasks to be reviewed, updated, and confirmed. This process is called Experience-Driven Reasoning. Specifically, the Data Interpreter integrates an external repository designated as the ’experience pool’ to archive essential elements of each task, including task description, final version code, and final answer. In the pool, all archived data is reorganized into reusable experiences based on the reflective mechanism (Zhao et al., 2023; Shin et al., 2023). These experiences, including both failed and successful attempts can provide a comprehensive context for a task. This pool functions as a valuable resource, enabling the retrieval of past experiences to inform and optimize new task executions. An experience can be reused if it is found to be one of the nearest neighbors of a new task from the vector store, which is generated through task-level reflective analysis. Specifically, for a certain task, top-k experiences are retrieved as the context of the current task, which can improve the accuracy and efficiency of its reasoning. This approach mirrors the fundamental principles of human cognition, where individuals take advantage of past experiences to enhance decision-making and problem-solving. More experimental evaluations that validate this approach can be found in Section 4.3. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Sirui Hong, DeepWisdom and these authors contributed equally to this work; (2) Yizhang Lin, DeepWisdom and these authors contributed equally to this work; (3) Bang Liu, Universite de Montreal & Mila and these author are listed in alphabetical order; (4) Bangbang Liu, DeepWisdom and these authors contributed equally to this work; (5) Binhao Wu, DeepWisdom and these authors contributed equally to this work; (6) Danyang Li, DeepWisdom and these authors contributed equally to this work; (7) Jiaqi Chen, Fudan University and these authors contributed equally to this work; (8) Jiayi Zhang, Renmin University of China and these authors contributed equally to this work; (9) Jinlin Wang, DeepWisdom and these authors contributed equally to this work; (10) Li Zhang, Fudan University and these authors contributed equally to this work; (11) Lingyao Zhang, these authors contributed equally to this work; (12) Min Yang, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences and these authors contributed equally to this work; (13) Mingchen Zhuge, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (14) Taicheng Guo, University of Notre Dame and these authors contributed equally to this work; (15) Tuo Zhou, The University of Hong Kong and these authors contributed equally to this work; (16) Wei Tao, Fudan University and these authors contributed equally to this work; (17) Wenyi Wang, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (18) Xiangru Tang, Yale University and these authors contributed equally to this work; (19) Xiangtao Lu, DeepWisdom and these authors contributed equally to this work; (20) Xiawu Zheng, Xiamen University and these authors contributed equally to this work; (21) Xinbing Liang, DeepWisdom, East China Normal University and these authors contributed equally to this work; (22) Yaying Fei, Beijing University of Technology and these authors contributed equally to this work; (23) Yuheng Cheng, The Chinese University of Hong Kong, Shenzhen and these authors contributed equally to this work; (24) Zongze Xu, DeepWisdom, Hohai University and these authors contributed equally to this work; (25) Chenglin Wu, DeepWisdom and a corresponding author. Authors: Authors: (1) Sirui Hong, DeepWisdom and these authors contributed equally to this work; (2) Yizhang Lin, DeepWisdom and these authors contributed equally to this work; (3) Bang Liu, Universite de Montreal & Mila and these author are listed in alphabetical order; (4) Bangbang Liu, DeepWisdom and these authors contributed equally to this work; (5) Binhao Wu, DeepWisdom and these authors contributed equally to this work; (6) Danyang Li, DeepWisdom and these authors contributed equally to this work; (7) Jiaqi Chen, Fudan University and these authors contributed equally to this work; (8) Jiayi Zhang, Renmin University of China and these authors contributed equally to this work; (9) Jinlin Wang, DeepWisdom and these authors contributed equally to this work; (10) Li Zhang, Fudan University and these authors contributed equally to this work; (11) Lingyao Zhang, these authors contributed equally to this work; (12) Min Yang, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences and these authors contributed equally to this work; (13) Mingchen Zhuge, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (14) Taicheng Guo, University of Notre Dame and these authors contributed equally to this work; (15) Tuo Zhou, The University of Hong Kong and these authors contributed equally to this work; (16) Wei Tao, Fudan University and these authors contributed equally to this work; (17) Wenyi Wang, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (18) Xiangru Tang, Yale University and these authors contributed equally to this work; (19) Xiangtao Lu, DeepWisdom and these authors contributed equally to this work; (20) Xiawu Zheng, Xiamen University and these authors contributed equally to this work; (21) Xinbing Liang, DeepWisdom, East China Normal University and these authors contributed equally to this work; (22) Yaying Fei, Beijing University of Technology and these authors contributed equally to this work; (23) Yuheng Cheng, The Chinese University of Hong Kong, Shenzhen and these authors contributed equally to this work; (24) Zongze Xu, DeepWisdom, Hohai University and these authors contributed equally to this work; (25) Chenglin Wu, DeepWisdom and a corresponding author. Editor's Note: This is Part 3 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Editor's Note: This is Part 3 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Editor's Note: This is Part 3 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Editor's Note: This is Part 3 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Table of Links Abstract and 1 Introduction 2 Related Work 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 4 Experiments 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 5 Conclusion and References A. Additional Results B. Implementation Results C. Details of Datasets Abstract and 1 Introduction Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 2 Related Work 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 4 Experiments 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 4 Experiments 4 Experiments 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 5 Conclusion and References A. Additional Results B. Implementation Results C. Details of Datasets 5 Conclusion and References 5 Conclusion and References A. Additional Results B. Implementation Results C. Details of Datasets 3 METHODOLOGY Our Data Interpreter uses dynamic planning within a hierarchical structure for real-time goal adjustment, as shown in Figure 2. More information can be found in Section 3.1. Then, Data Interpreter completes tasks by breaking them down as outlined in the plan and executing code, with tools incorporated as necessary to augment proficiency. Detailed explanations are provided in Section 3.2. Moreover, each task is subjected to a validation process to ensure its reliability. The process of executing a task is then characterized and analyzed as an experience, which can be retrieved for similar tasks in the future. More information on this mechanism is provided in Section 3.3. 3.1 DYNAMIC PLANNING WITH HIERARCHICAL STRUCTURE The intensive data dependence complicates modeling and orchestrating data science pipelines. In this section, we outline these pipelines naturally organized by the graph, modeling them with the hierarchical structure, and introduce dynamic plan management for effective orchestration. 3.1.1 HIERARCHICAL GRAPH FOR DATA SCIENCE PROBLEMS 3.1.1 HIERARCHICAL GRAPH FOR DATA SCIENCE PROBLEMS Data science projects encompass extensive detailing and long-range pipelines, complicating the direct planning of all detailed tasks and coding. This complexity necessitates careful planning, execution, and management (Biswas et al., 2022). Drawing inspiration from the application of hierarchical planning in automated machine learning tasks (Mohr et al., 2018; Mubarak & Koeshidayatullah, 2023), we organize the data science pipelines via hierarchical structure, which initially decomposes the intricate data science problem into manageable tasks and further break down each task into specific actions executed through code (in Figure 3(a)). Figure 3(b) illustrates the hierarchical data science task pipelines composed of data exploration, correlation analysis, outlier detection, feature engineering, model training, model evaluation, and result visualization tasks (green region in Figure 3 (b)). Alongside the task graph, code as an implicit graph represents the corresponding execution actions of the code generated by LLMs (the purple region in Figure 3 (b)). This hierarchical graph structure facilitates structured problem-solving for our Data Interpreter and adeptly captures both sequential task relationships (such as from model training to model evaluation and visualization) and parallel task relationships (such as outliers detection and correlation analysis). Therefore, we propose structuring data science workflows as a hierarchical directed acyclic graph (DAG), aptly representing data science pipelines at both task and coding levels. Our Data Interpreter leverages the advanced planning capabilities of LLMs to decompose the complex data science problem into multiple tasks consistent with the problem goal and express their executing dependencies through a graph structure. We design the metadata for each task node, including task description, completion status, and code. Appendix B.2 describes more details about the task node. 3.1.2 DYNAMIC PLAN MANAGEMENT 3.1.2 DYNAMIC PLAN MANAGEMENT Leveraging a hierarchical graph structure, tasks are executed automatically. Unlike previous methods (Wei et al., 2022; Besta et al., 2023; Yao et al., 2022) that create and execute plans once before execution for static problems, we investigate that in intensive data dependence scenario, the intermediate data among tasks will be dynamically changed during execution due to tool operations or new information in the workflow, which can lead to runtime errors if the data does not match the pre-defined plan. To tackle this, we introduce a dynamic plan management, detailed in Figure 4. To ensure efficient progress execution and facilitate plan modifications, our Data Interpreter dynamically updates the corresponding code, execution result, and status of each node in the task graph following each execution. A task is considered completed by successfully executing the corresponding code. Once completed, the task is marked as “Success” and added to a completed tasks list, proceeding to the next task according to the plan. On the contrary, the task is marked as “Failure” if it fails. We have designed two strategies: Self-debugging and Human editing, aimed at enhancing autonomous completeness and correctness. In the event of task failure, Self-debugging is enabled, utilizing LLMs to debug the code based on runtime errors, up to a predefined number of attempts. If the task remains unresolved, it is flagged as “Failure”. Due to the high logic requirements of data science problems, an additional human-in-the-loop approach, human editing, is introduced to ensure code precision. When Human editing is activated, our Data Interpreter holds the task until it is manually modified, upon which it is rerun based on human input. The Data Interpreter will regenerate the plan for failed or manually edited tasks based on current episodic memory and the context of execution. Specifically, the regenerated task graph is sorted in topological order and then compared to the original task graph using a prefix matching algorithm (Waldvogel, 2000) to identify any differences in instructions. Based on this comparison, the fork can be identified. The final output of this process includes all unchanged tasks existing before the fork and any new tasks added or modified after the fork. Throughout execution, our Data Interpreter monitors the dynamic task graph, promptly removing failed tasks, generating refined tasks, and updating the graph. This avoids the inefficiency of generating fine-grained planning tasks at once and improves the success rate of plans requiring multi-step execution, making it better suited for scenarios where the data flow constantly changes in data science problems. 3.2 TOOL UTILIZATION AND GENERATION To address the intricate nature of tasks that are too complex to be entirely coded from scratch, utilizing existing toolkits or integrating existing code snippets becomes essential. Take, for example, feature engineering, which demands domain-specific expertise for data transformation. In such cases, using tools crafted by experts can be significantly more effective, as generating this type of code directly through LLMs poses considerable challenges. Similarly, email processing involves orchestrating different code snippets to establish efficient workflows. To improve the efficiency of using these tools, we suggest a two-pronged method: one focuses on recommending or generating the most suitable tools, while the other organizes these tools effectively. This approach offers clear advantages over previous methods (Schick et al., 2024; Hong et al., 2023), which relied on mere library calls or did not incorporate tools using clear modularization in the code. By combining the strengths and mitigating the weaknesses of these methods, our approach presents a more balanced and efficient solution. Notice that tool usages follow the principles and procedures of the task graph described in Section 3.1.1; the use of tools itself would be considered one of the tasks in the graph. 3.2.1 TOOL RECOMMENDATION AND ORGANIZATION 3.2.1 TOOL RECOMMENDATION AND ORGANIZATION In tool recommendations, the Data Interpreter classifies tools based on task descriptions and types. This process effectively narrows down the pool of potential tools, making the selection process more efficient for subsequent tasks. It then identifies the top-k tools that best fit the tasks by evaluating the compatibilities of the candidate tools with one task. Additionally, we incorporate a tool schema to help LLMs understand these tools’ functionalities and use cases, embedding this schema during execution phases as outlined in Appendix B.3.1. This schema-guided understanding enables more accurate tool selection and application. Besides, during execution, the algorithm dynamically adjusts tool parameters using LLMs, considering the code context and task requirements. This dynamic parameter adjustment improves tool adaptability to the tasks at hand. In tool organizations, our method employs LLMs to seamlessly integrate tools into the code, optimally positioning them based on a thorough analysis of the tool functions. This is particularly useful for complex tasks such as feature engineering, facilitating a process that is efficient and adaptable to the integration of tools. We refine this process by considering the context of the current task and the tools at our disposal. The LLM is directed to craft code that not only invokes the required tool functions but also seamlessly integrates these calls with other aspects of the code. This allows for dynamic orchestration of various tools tailored to the specific requirements of the encoding process. A prime example of this adaptability is demonstrated with the CatCount tool in our deployment pipeline (Figure 5), showcasing the dynamic use of its fit and transform functions according to the task context. This strategy ensures that tool integration is automated and precisely aligned with task demands, significantly boosting coding efficiency and flexibility. 3.2.2 CONTINUOUS TOOL EVOLUTION 3.2.2 CONTINUOUS TOOL EVOLUTION To minimize the frequency of debugging and improve execution efficiency, our model learns from experience during task execution. After each task, it abstracts tools by distilling their core functionalities, stripping away any sample-specific logic. This creates versatile, generic tool functions that are added to the library for future use. In addition, the Data Interpreter automatically ensures the reliability of these tools by conducting rigorous unit tests and leveraging its self-debugging capabilities through LLMs. Consequently, Data Interpreter facilitates rapidly transforming sample-specific code snippets into reusable tool functions, continuously improving its toolkit and coding expertise over time. 3.3 ENHANCING REASONING WITH VERIFICATION AND EXPERIENCE Our designed task graph, dynamic plan management, and tool utilization can improve task planning and tool mastery. However, relying only on error detection or capturing exceptions is inadequate feedback to complete a task. For complex reasoning problems, even if the code runs without raising errors, it can still contain logical flaws (Wang et al., 2023; Zhou et al., 2023a).Therefore, in this section, we introduce automated confidence-based verification and leverage experience further to improve the correctness and efficiency of the reasoning results. 3.3.1 AUTOMATED CONFIDENCE-BASED VERIFICATION 3.3.1 AUTOMATED CONFIDENCE-BASED VERIFICATION To address this issue, we propose a simple yet effective technique, Automated Confidence-based Verification (ACV) , which introduces an interpretation layer between the environment and the Data Interpreter. This approach allows LLMs to evaluate code execution results and determine if the code solution is mathematically rigorous or logically correct. Specifically, once a code solution for the task starts to be executed, the Data Interpreter is required to generate a validation code to ensure that the output result complies with the task requirement. The validation code is designed to simulate the logical process according to the task description and to verify the correctness of the result generated by the code. This process is similar to performing white-box testing on each task, guaranteeing that the code produces the expected output. The confidence score helps the Data Interpreter choose a more accurate result as the final answer by ranking the average confidence scores corresponding to different execution results. A specific example of this automated confidence-based verification process from the MATH dataset is shown in Figure 6. The validation code takes into account both the task, the code, and its execution result. The function is prime is to check the code, the probability is generated from the task description, and given answer is the candidate answer. In this example, the process undergoes five separate verifications. Specifically, the results of the code execution (that is, the candidate response) for the first and fifth validations are 1/108. For the remaining verifications, the results consistently are 56/219. The candidate answer (i.e. 1/108) gets two confidence scores (0.2 and 1), with an average of 0.6. Another candidate answer (i.e. 56/219) gets three confidence scores (0.2, 0.5, and 0.2), with an average of 0.3. As the former gets a higher average confidence score, our Data Interpreter selects 1/108 as the final answer, and it is correct. In contrast, simply using the majority voting strategy will choose the latter, which is wrong. 3.3.2 EXPERIENCE-DRIVEN REASONING 3.3.2 EXPERIENCE-DRIVEN REASONING As the automated confidence-based verification makes the task-solving process more transparent and reliable, the data generated in the verification can be reused as experience for other tasks. Therefore, we improve the Data Interpreter’s adaptability through a reflective analysis that allows tasks to be reviewed, updated, and confirmed. This process is called Experience-Driven Reasoning. Specifically, the Data Interpreter integrates an external repository designated as the ’experience pool’ to archive essential elements of each task, including task description, final version code, and final answer. In the pool, all archived data is reorganized into reusable experiences based on the reflective mechanism (Zhao et al., 2023; Shin et al., 2023). These experiences, including both failed and successful attempts can provide a comprehensive context for a task. This pool functions as a valuable resource, enabling the retrieval of past experiences to inform and optimize new task executions. An experience can be reused if it is found to be one of the nearest neighbors of a new task from the vector store, which is generated through task-level reflective analysis. Specifically, for a certain task, top-k experiences are retrieved as the context of the current task, which can improve the accuracy and efficiency of its reasoning. This approach mirrors the fundamental principles of human cognition, where individuals take advantage of past experiences to enhance decision-making and problem-solving. More experimental evaluations that validate this approach can be found in Section 4.3. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv

New AI Plans, Learns, and Adapts in Real Time—One Task at a Time

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Untitled Story

AI Crushes the Competition in Math, Machine Learning, and Open-Ended Tasks—Here’s How It Did It

This New AI Tool Claims to Solve Data Problems Better Than Anything Else—Here’s Why That Matters

Researchers Are Teaching AI to Plan, Think, and Build Its Own Tools—for Data Science

AI Crushes the Competition in Math, Machine Learning, and Open-Ended Tasks—Here’s How It Did It

Researchers Say New AI Outperforms Other Models on Data Science Tasks

AI Crushes the Competition in Math, Machine Learning, and Open-Ended Tasks—Here’s How It Did It

This New AI Tool Claims to Solve Data Problems Better Than Anything Else—Here’s Why That Matters

Researchers Are Teaching AI to Plan, Think, and Build Its Own Tools—for Data Science

AI Crushes the Competition in Math, Machine Learning, and Open-Ended Tasks—Here’s How It Did It

Researchers Say New AI Outperforms Other Models on Data Science Tasks

Light-Mode

Classic

Newspaper

Dark-Mode

Neon Noir

Minty

HN StartUps