This New AI Tool Claims to Solve Data Problems Better Than Anything Else—Here’s Why That Matters

Authors: (1) Sirui Hong, DeepWisdom, and these authors contributed equally to this work; (2) Yizhang Lin, DeepWisdom, and these authors contributed equally to this work; (3) Bang Liu, Universite de Montreal & Mila and these author are listed in alphabetical order; (4) Bangbang Liu, DeepWisdom and these authors contributed equally to this work; (5) Binhao Wu, DeepWisdom and these authors contributed equally to this work; (6) Danyang Li, DeepWisdom and these authors contributed equally to this work; (7) Jiaqi Chen, Fudan University and these authors contributed equally to this work; (8) Jiayi Zhang, Renmin University of China and these authors contributed equally to this work; (9) Jinlin Wang, DeepWisdom and these authors contributed equally to this work; (10) Li Zhang, Fudan University and these authors contributed equally to this work; (11) Lingyao Zhang, these authors contributed equally to this work; (12) Min Yang, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences and these authors contributed equally to this work; (13) Mingchen Zhuge, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (14) Taicheng Guo, University of Notre Dame and these authors contributed equally to this work; (15) Tuo Zhou, The University of Hong Kong and these authors contributed equally to this work; (16) Wei Tao, Fudan University and these authors contributed equally to this work; (17) Wenyi Wang, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (18) Xiangru Tang, Yale University and these authors contributed equally to this work; (19) Xiangtao Lu, DeepWisdom and these authors contributed equally to this work; (20) Xiawu Zheng, Xiamen University and these authors contributed equally to this work; (21) Xinbing Liang, DeepWisdom, East China Normal University and these authors contributed equally to this work; (22) Yaying Fei, Beijing University of Technology and these authors contributed equally to this work; (23) Yuheng Cheng, The Chinese University of Hong Kong, Shenzhen and these authors contributed equally to this work; (24) Zongze Xu, DeepWisdom, Hohai University and these authors contributed equally to this work; (25) Chenglin Wu, DeepWisdom and a corresponding author. Editor's Note: This is Part 1 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Table of Links Abstract and 1 Introduction 2 Related Work 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 4 Experiments 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 5 Conclusion and References A. Additional Results B. Implementation Results C. Details of Datasets ABSTRACT Large Language Model (LLM)-based agents have demonstrated remarkable effectiveness. However, their performance can be compromised in data science scenarios that require real-time data adjustment, expertise in optimization due to complex dependencies among various tasks, and the ability to identify logical errors for precise reasoning. In this study, we introduce the Data Interpreter, a solution designed to solve with code that emphasizes three pivotal techniques to augment problem-solving in data science: 1) dynamic planning with hierarchical graph structures for real-time data adaptability; 2) tool integration dynamically to enhance code proficiency during execution, enriching the requisite expertise; 3) logical inconsistency identification in feedback, and efficiency enhancement through experience recording. We evaluate the Data Interpreter on various data science and real-world tasks. Compared to open-source baselines, it demonstrated superior performance, exhibiting significant improvements in machine learning tasks, increasing from 0.86 to 0.95. Additionally, it showed a 26% increase in the MATH dataset and a remarkable 112% improvement in open-ended tasks. The solution will be released at https://github.com/geekan/MetaGPT. 1 INTRODUCTION Large Language Models (LLMs) have enabled agents to excel in a wide range of applications, demonstrating their adaptability and effectiveness (Guo et al., 2024; Wu et al., 2023a; Zhou et al., 2023b). These LLM-powered agents have significantly influenced areas like software engineering (Hong et al., 2023), navigating complex open-world scenarios (Wang et al., 2023; Chen et al., 2024a), facilitating collaborative multi-agent structures for multimodal tasks (Zhuge et al., 2023), improving the responsiveness of virtual assistants (Lu et al., 2023), optimizing group intelligence (Zhuge et al., 2024), and contributing to scientific research (Tang et al., 2024). Recent studies focused on improving the problem-solving capabilities of these agents by improving their reasoning process, aiming for increased sophistication and efficiency (Zhang et al., 2023; Besta et al., 2023; Sel et al., 2023; Yao et al., 2024; Wei et al., 2022). However, data-centric scientific problems, including machine learning, data analysis, and mathematical problem-solving, present unique challenges that remain to be addressed. The machine learning process involves complex, lengthy task handling steps, characterized by intricate dependencies among multiple tasks. This requires expert intervention for process optimization and dynamic adjustment in the event of failure or data updates. It is often challenging for LLMs to provide the correct solution in a single attempt. Furthermore, these problems demand precise reasoning, and thorough data verification (RomeraParedes et al., 2023), which poses additional challenges to the LLM-based agent framework. Moreover, existing works such as (Qiao et al., 2023; OpenAI, 2023; Lucas, 2023) address datacentric problems through code-based problem-solving methods, known as the interpreter paradigm, which combines static requirement decomposition with code execution. However, several key challenges arise when employing these frameworks in practical data science tasks: 1) Data dependence intensity: The complexity inherent in data science arises from the intricate interplay among various steps, which are subject to real-time changes (Liu et al., 2021). For accurate results, data cleaning and comprehensive feature engineering are prerequisites before developing any machine learning model. Therefore, it is critical to monitor data changes and dynamically adjust to the transformed data and variables. The machine learning modeling process, encompassing feature selection, model training, and evaluation, involves a broad spectrum of processing operators and search spaces (Zheng et al., 2021). The challenge lies in generating and resolving the entire process code simultaneously. 2) Refined domain knowledge: The specialized knowledge and coding practices of data scientists are pivotal in addressing data-related challenges. Typically embedded in proprietary code and data, this knowledge often remains inaccessible to current LLMs. For instance, generating code for data transformation in specific domains such as energy or geology may pose a challenge for LLMs without the requisite domain expertise. Existing methodologies predominantly depend on LLMs, a reliance that may streamline the process but potentially compromise performance. 3) Rigorous logic requirements: Currently, interpreters such as (Qiao et al., 2023; OpenAI, 2023; Lucas, 2023) incorporate code execution and error capturing capabilities to enhance problem-solving performance. However, they often neglect error-free execution, erroneously considering it as correct. While basic programming tasks can be streamlined and depend on immediate execution feedback when requirements are delineated, data science problems often pose ambiguous, irregular, and not well-defined requirements, making it difficult for LLMs to understand. Consequently, LLM-generated code solutions for task resolution may contain ambiguities that necessitate rigorous validation of logical soundness, extending beyond mere execution feedback. To address the aforementioned challenges, we introduce an LLM-based agent, called the Data Interpreter, designed specifically for the field of data science. This agent follows a plan-code-verify approach to fulfill human requirements by breaking down tasks, executing code, and verifying feedback. Specifically, we propose 1) Dynamic planning with hierarchical structure: Our Data Interpreter employs hierarchical graph structures to comprehend the inherent complexities of data science more effectively. A dynamic planning approach equips it with the adaptability to task variations, proving especially efficient in monitoring data changes and managing intricate variable dependencies inherent in data science problems. 2) Tool utilization and generation: We enhance coding proficiency by integrating various human-authored code snippets, and creating custom tools for specific tasks beyond mere API-focused capabilities. This process involves the automatic combination of diverse tools with self-generated code. It utilizes task-level execution to independently build and expand its tool library, simplify tool usage, and perform code restructuring as needed. 3) Enhancing reasoning with logic bug aware: This is based on the confidence score derived from execution results and test-driven validations, which are essential for an exception-free scenario. It detects inconsistencies between the code solution and test code execution and compares multiple trials to reduce logic errors. Throughout the execution and reasoning process, task-level experiences, primarily comprising metadata and runtime trajectory, which include both successes and failures, are recorded. As depicted in Figure 1, our Data Interpreter significantly surpasses existing open-source frameworks. Compared to these baselines, the Data Interpreter exhibits superior performance, with 10.3% (from 0.86 to 0.95) improvement in machine learning tasks and 26% enhancement on the MATH dataset, demonstrating robust problem-solving capabilities. In open-ended tasks, its performance has more than doubled, marking a 112% increase, showcasing its efficacy in tackling a wide spectrum of challenges. We summarize our contributions as follows: • We propose a dynamic planning framework with hierarchical structures, enhancing adaptability and problem-solving capabilities in data science tasks. • We improve the proficiency and efficiency of coding in LLMs by introducing automated tool integration for tool utilization and generation. • We improve reasoning by integrating verification and experience, thereby enhancing the accuracy and efficiency of problem-solving. • Our experiments demonstrate that our Data Interpreter exceeds existing benchmarks in machine learning tasks, mathematical problems, and open-ended tasks, thus setting a new standard for performance. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Sirui Hong, DeepWisdom, and these authors contributed equally to this work; (2) Yizhang Lin, DeepWisdom, and these authors contributed equally to this work; (3) Bang Liu, Universite de Montreal & Mila and these author are listed in alphabetical order; (4) Bangbang Liu, DeepWisdom and these authors contributed equally to this work; (5) Binhao Wu, DeepWisdom and these authors contributed equally to this work; (6) Danyang Li, DeepWisdom and these authors contributed equally to this work; (7) Jiaqi Chen, Fudan University and these authors contributed equally to this work; (8) Jiayi Zhang, Renmin University of China and these authors contributed equally to this work; (9) Jinlin Wang, DeepWisdom and these authors contributed equally to this work; (10) Li Zhang, Fudan University and these authors contributed equally to this work; (11) Lingyao Zhang, these authors contributed equally to this work; (12) Min Yang, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences and these authors contributed equally to this work; (13) Mingchen Zhuge, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (14) Taicheng Guo, University of Notre Dame and these authors contributed equally to this work; (15) Tuo Zhou, The University of Hong Kong and these authors contributed equally to this work; (16) Wei Tao, Fudan University and these authors contributed equally to this work; (17) Wenyi Wang, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (18) Xiangru Tang, Yale University and these authors contributed equally to this work; (19) Xiangtao Lu, DeepWisdom and these authors contributed equally to this work; (20) Xiawu Zheng, Xiamen University and these authors contributed equally to this work; (21) Xinbing Liang, DeepWisdom, East China Normal University and these authors contributed equally to this work; (22) Yaying Fei, Beijing University of Technology and these authors contributed equally to this work; (23) Yuheng Cheng, The Chinese University of Hong Kong, Shenzhen and these authors contributed equally to this work; (24) Zongze Xu, DeepWisdom, Hohai University and these authors contributed equally to this work; (25) Chenglin Wu, DeepWisdom and a corresponding author. Authors: Authors: (1) Sirui Hong, DeepWisdom, and these authors contributed equally to this work; (2) Yizhang Lin, DeepWisdom, and these authors contributed equally to this work; (3) Bang Liu, Universite de Montreal & Mila and these author are listed in alphabetical order; (4) Bangbang Liu, DeepWisdom and these authors contributed equally to this work; (5) Binhao Wu, DeepWisdom and these authors contributed equally to this work; (6) Danyang Li, DeepWisdom and these authors contributed equally to this work; (7) Jiaqi Chen, Fudan University and these authors contributed equally to this work; (8) Jiayi Zhang, Renmin University of China and these authors contributed equally to this work; (9) Jinlin Wang, DeepWisdom and these authors contributed equally to this work; (10) Li Zhang, Fudan University and these authors contributed equally to this work; (11) Lingyao Zhang, these authors contributed equally to this work; (12) Min Yang, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences and these authors contributed equally to this work; (13) Mingchen Zhuge, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (14) Taicheng Guo, University of Notre Dame and these authors contributed equally to this work; (15) Tuo Zhou, The University of Hong Kong and these authors contributed equally to this work; (16) Wei Tao, Fudan University and these authors contributed equally to this work; (17) Wenyi Wang, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work; (18) Xiangru Tang, Yale University and these authors contributed equally to this work; (19) Xiangtao Lu, DeepWisdom and these authors contributed equally to this work; (20) Xiawu Zheng, Xiamen University and these authors contributed equally to this work; (21) Xinbing Liang, DeepWisdom, East China Normal University and these authors contributed equally to this work; (22) Yaying Fei, Beijing University of Technology and these authors contributed equally to this work; (23) Yuheng Cheng, The Chinese University of Hong Kong, Shenzhen and these authors contributed equally to this work; (24) Zongze Xu, DeepWisdom, Hohai University and these authors contributed equally to this work; (25) Chenglin Wu, DeepWisdom and a corresponding author. Editor's Note: This is Part 1 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Editor's Note: This is Part 1 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Editor's Note: This is Part 1 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Editor's Note: This is Part 1 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below. Table of Links Abstract and 1 Introduction 2 Related Work 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 4 Experiments 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 5 Conclusion and References A. Additional Results B. Implementation Results C. Details of Datasets Abstract and 1 Introduction Abstract and 1 Introduction Abstract and 1 Introduction 2 Related Work 2 Related Work 2 Related Work 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3 Methodology and 3.1 Dynamic planning with Hierarchical Structure 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 3.2 Tool utilization and generation 3.3 Enhancing reasoning with verification and experience 4 Experiments 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 4 Experiments 4 Experiments 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 4.1 Experimental Setup 4.2 Main Result 4.3 Ablation Study 5 Conclusion and References A. Additional Results B. Implementation Results C. Details of Datasets 5 Conclusion and References 5 Conclusion and References A. Additional Results B. Implementation Results C. Details of Datasets ABSTRACT Large Language Model (LLM)-based agents have demonstrated remarkable effectiveness. However, their performance can be compromised in data science scenarios that require real-time data adjustment, expertise in optimization due to complex dependencies among various tasks, and the ability to identify logical errors for precise reasoning. In this study, we introduce the Data Interpreter, a solution designed to solve with code that emphasizes three pivotal techniques to augment problem-solving in data science: 1) dynamic planning with hierarchical graph structures for real-time data adaptability; 2) tool integration dynamically to enhance code proficiency during execution, enriching the requisite expertise; 3) logical inconsistency identification in feedback, and efficiency enhancement through experience recording. We evaluate the Data Interpreter on various data science and real-world tasks. Compared to open-source baselines, it demonstrated superior performance, exhibiting significant improvements in machine learning tasks, increasing from 0.86 to 0.95. Additionally, it showed a 26% increase in the MATH dataset and a remarkable 112% improvement in open-ended tasks. The solution will be released at https://github.com/geekan/MetaGPT. 1 INTRODUCTION Large Language Models (LLMs) have enabled agents to excel in a wide range of applications, demonstrating their adaptability and effectiveness (Guo et al., 2024; Wu et al., 2023a; Zhou et al., 2023b). These LLM-powered agents have significantly influenced areas like software engineering (Hong et al., 2023), navigating complex open-world scenarios (Wang et al., 2023; Chen et al., 2024a), facilitating collaborative multi-agent structures for multimodal tasks (Zhuge et al., 2023), improving the responsiveness of virtual assistants (Lu et al., 2023), optimizing group intelligence (Zhuge et al., 2024), and contributing to scientific research (Tang et al., 2024). Recent studies focused on improving the problem-solving capabilities of these agents by improving their reasoning process, aiming for increased sophistication and efficiency (Zhang et al., 2023; Besta et al., 2023; Sel et al., 2023; Yao et al., 2024; Wei et al., 2022). However, data-centric scientific problems, including machine learning, data analysis, and mathematical problem-solving, present unique challenges that remain to be addressed. The machine learning process involves complex, lengthy task handling steps, characterized by intricate dependencies among multiple tasks. This requires expert intervention for process optimization and dynamic adjustment in the event of failure or data updates. It is often challenging for LLMs to provide the correct solution in a single attempt. Furthermore, these problems demand precise reasoning, and thorough data verification (RomeraParedes et al., 2023), which poses additional challenges to the LLM-based agent framework. Moreover, existing works such as (Qiao et al., 2023; OpenAI, 2023; Lucas, 2023) address datacentric problems through code-based problem-solving methods, known as the interpreter paradigm, which combines static requirement decomposition with code execution. However, several key challenges arise when employing these frameworks in practical data science tasks: 1) Data dependence intensity: The complexity inherent in data science arises from the intricate interplay among various steps, which are subject to real-time changes (Liu et al., 2021). For accurate results, data cleaning and comprehensive feature engineering are prerequisites before developing any machine learning model. Therefore, it is critical to monitor data changes and dynamically adjust to the transformed data and variables. The machine learning modeling process, encompassing feature selection, model training, and evaluation, involves a broad spectrum of processing operators and search spaces (Zheng et al., 2021). The challenge lies in generating and resolving the entire process code simultaneously. 2) Refined domain knowledge: The specialized knowledge and coding practices of data scientists are pivotal in addressing data-related challenges. Typically embedded in proprietary code and data, this knowledge often remains inaccessible to current LLMs. For instance, generating code for data transformation in specific domains such as energy or geology may pose a challenge for LLMs without the requisite domain expertise. Existing methodologies predominantly depend on LLMs, a reliance that may streamline the process but potentially compromise performance. 3) Rigorous logic requirements: Currently, interpreters such as (Qiao et al., 2023; OpenAI, 2023; Lucas, 2023) incorporate code execution and error capturing capabilities to enhance problem-solving performance. However, they often neglect error-free execution, erroneously considering it as correct. While basic programming tasks can be streamlined and depend on immediate execution feedback when requirements are delineated, data science problems often pose ambiguous, irregular, and not well-defined requirements, making it difficult for LLMs to understand. Consequently, LLM-generated code solutions for task resolution may contain ambiguities that necessitate rigorous validation of logical soundness, extending beyond mere execution feedback. 1) Data dependence intensity: 1) Data dependence intensity: 2) Refined domain knowledge: 2) Refined domain knowledge: 3) Rigorous logic requirements: 3) Rigorous logic requirements: To address the aforementioned challenges, we introduce an LLM-based agent, called the Data Interpreter, designed specifically for the field of data science. This agent follows a plan-code-verify approach to fulfill human requirements by breaking down tasks, executing code, and verifying feedback. Specifically, we propose 1) Dynamic planning with hierarchical structure: Our Data Interpreter employs hierarchical graph structures to comprehend the inherent complexities of data science more effectively. A dynamic planning approach equips it with the adaptability to task variations, proving especially efficient in monitoring data changes and managing intricate variable dependencies inherent in data science problems. 2) Tool utilization and generation: We enhance coding proficiency by integrating various human-authored code snippets, and creating custom tools for specific tasks beyond mere API-focused capabilities. This process involves the automatic combination of diverse tools with self-generated code. It utilizes task-level execution to independently build and expand its tool library, simplify tool usage, and perform code restructuring as needed. 3) Enhancing reasoning with logic bug aware: This is based on the confidence score derived from execution results and test-driven validations, which are essential for an exception-free scenario. It detects inconsistencies between the code solution and test code execution and compares multiple trials to reduce logic errors. Throughout the execution and reasoning process, task-level experiences, primarily comprising metadata and runtime trajectory, which include both successes and failures, are recorded. 1) Dynamic planning with hierarchical structure: 1) Dynamic planning with hierarchical structure: 2) Tool utilization and generation: 2) Tool utilization and generation: 3) Enhancing reasoning with logic bug aware: 3) Enhancing reasoning with logic bug aware: As depicted in Figure 1, our Data Interpreter significantly surpasses existing open-source frameworks. Compared to these baselines, the Data Interpreter exhibits superior performance, with 10.3% (from 0.86 to 0.95) improvement in machine learning tasks and 26% enhancement on the MATH dataset, demonstrating robust problem-solving capabilities. In open-ended tasks, its performance has more than doubled, marking a 112% increase, showcasing its efficacy in tackling a wide spectrum of challenges. We summarize our contributions as follows: • We propose a dynamic planning framework with hierarchical structures, enhancing adaptability and problem-solving capabilities in data science tasks. • We improve the proficiency and efficiency of coding in LLMs by introducing automated tool integration for tool utilization and generation. • We improve reasoning by integrating verification and experience, thereby enhancing the accuracy and efficiency of problem-solving. • Our experiments demonstrate that our Data Interpreter exceeds existing benchmarks in machine learning tasks, mathematical problems, and open-ended tasks, thus setting a new standard for performance. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv