Authors:
(1) Sirui Hong, DeepWisdom and these authors contributed equally to this work;
(2) Yizhang Lin, DeepWisdom and these authors contributed equally to this work;
(3) Bang Liu, Universite de Montreal & Mila and these author are listed in alphabetical order;
(4) Bangbang Liu, DeepWisdom and these authors contributed equally to this work;
(5) Binhao Wu, DeepWisdom and these authors contributed equally to this work;
(6) Danyang Li, DeepWisdom and these authors contributed equally to this work;
(7) Jiaqi Chen, Fudan University and these authors contributed equally to this work;
(8) Jiayi Zhang, Renmin University of China and these authors contributed equally to this work;
(9) Jinlin Wang, DeepWisdom and these authors contributed equally to this work;
(10) Li Zhang, Fudan University and these authors contributed equally to this work;
(11) Lingyao Zhang, these authors contributed equally to this work;
(12) Min Yang, 5Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences and these authors contributed equally to this work;
(13) Mingchen Zhuge, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work;
(14) Taicheng Guo, University of Notre Dame and these authors contributed equally to this work;
(15) Tuo Zhou, The University of Hong Kong and these authors contributed equally to this work;
(16) Wei Tao, Fudan University and these authors contributed equally to this work;
(17) Wenyi Wang, AI Initiative, King Abdullah University of Science and Technology and these authors contributed equally to this work;
(18) Xiangru Tang, Yale University and these authors contributed equally to this work;
(19) Xiangtao Lu, DeepWisdom and these authors contributed equally to this work;
(20) Xiawu Zheng, Xiamen University and these authors contributed equally to this work;
(21) Xinbing Liang, DeepWisdom, East China Normal University and these authors contributed equally to this work;
(22) Yaying Fei, Beijing University of Technology and these authors contributed equally to this work;
(23) Yuheng Cheng, The Chinese University of Hong Kong, Shenzhen and these authors contributed equally to this work;
(24) Zongze Xu, DeepWisdom, Hohai University and these authors contributed equally to this work;
(25) Chenglin Wu, DeepWisdom and a corresponding author.
Editor's Note: This is Part 5 of 5 of a research study detailing the development of Data Interpreter, a solution for various data science and real-world tasks. Read the rest below.
3 Methodology and 3.1 Dynamic planning with Hierarchical Structure
A. Additional Results
B. Implementation Results
C. Details of Datasets
In this paper, we introduced the Data Interpreter, a solution for data science problem-solving through dynamic planning with hierarchical graphs, tools integration and evolution, and automated confidence-based verification. Our Data Interpreter is meticulously designed to address the data dependence intensity, refined domain knowledge, and rigorous logic requirements inherent in data science, enhancing reliability, automation, and reasoning capability in managing sophisticated data science tasks. Through extensive evaluations, our Data Interpreter has outperformed various opensource frameworks in machine learning tasks, mathematical problems, and real-world task performance, signifying a substantial advancement in the capabilities of LLM-based agents for data science.
01-ai. Yi-34B-Chat. https://huggingface.co/01-ai/Yi-VL-34B, 2023.
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint, 2023.
Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint, 2023.
Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint, 2024.
Sumon Biswas, Mohammad Wardat, and Hridesh Rajan. The art and practice of data science pipelines: A comprehensive study of data science pipelines in theory, in-the-small, and in-the-large. In ICSE, 2022.
Sebastian Bordt, Ben Lengerich, Harsha Nori, and Rich Caruana. Data science with llms and interpretable models. arXiv preprint, 2024.
Tianle Cai, Xuezhi Wang, Tengyu Ma, Xinyun Chen, and Denny Zhou. Large language models as tool makers. arXiv preprint, 2023.
Jiaqi Chen, Yuxian Jiang, Jiachen Lu, and Li Zhang. S-agents: self-organizing agents in open-ended environment. arXiv preprint, 2024a.
Kexin Chen, Hanqun Cao, Junyou Li, Yuyang Du, Menghao Guo, Xin Zeng, Lanqing Li, Jiezhong Qiu, Pheng Ann Heng, and Guangyong Chen. An autonomous large language model agent for chemical literature data mining. arXiv preprint, 2024b.
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint, 2022.
Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. Pal: Program-aided language models. In ICML, 2023.
Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint, 2023.
Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint, 2024.
Md Mahadi Hassan, Alex Knipper, and Shubhra Kanti Karmaker Santu. Chatgpt as your personal data scientist. arXiv preprint, 2023.
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint, 2021.
Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multiagent collaborative framework. arXiv preprint, 2023.
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Benchmarking large language models as ai research agents. arXiv preprint, 2023.
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. Understanding the planning of llm agents: A survey. arXiv preprint, 2024.
Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint, 2024.
Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint, 2023.
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. In ICRA, 2023.
Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Zhiheng Li, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, et al. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint, 2023.
Zhengying Liu, Adrien Pavao, Zhen Xu, Sergio Escalera, Fabio Ferreira, Isabelle Guyon, Sirui Hong, Frank Hutter, Rongrong Ji, Julio CS Jacques Junior, et al. Winning solutions and post-challenge analyses of the chalearn autodl challenge 2019. TPAMI, 2021.
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. Chameleon: Plug-and-play compositional reasoning with large language models. NeurIPS, 2023.
Killian Lucas. GitHub - KillianLucas/open-interpreter: A natural language interface for computers — github.com. https://github.com/KillianLucas/open-interpreter, 2023.
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback. NeurIPS, 2024.
Felix Mohr, Marcel Wever, and Eyke Hullermeier. Ml-plan: Automated machine learning via hier- ¨ archical planning. Machine Learning, 2018.
Yousef Mubarak and Ardiansyah Koeshidayatullah. Hierarchical automated machine learning (automl) for advanced unconventional reservoir characterization. Scientific Reports, 2023.
OpenAI. GPT-4-Code-Interpreter. https://chat.openai.com/?model=gpt-4-codeinterpreter, 2023.
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint, 2023.
Cheng Qian, Chi Han, Yi Fung, Yujia Qin, Zhiyuan Liu, and Heng Ji. Creator: Tool creation for disentangling abstract and concrete reasoning of large language models. In Findings of EMNLP, 2023.
Bo Qiao, Liqun Li, Xu Zhang, Shilin He, Yu Kang, Chaoyun Zhang, Fangkai Yang, Hang Dong, Jue Zhang, Lu Wang, Minghua Ma, Pu Zhao, Si Qin, Xiaoting Qin, Chao Du, Yong Xu, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. Taskweaver: A code-first agent framework. arXiv preprint, 2023.
Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Matej Balog, M Pawan Kumar, Emilien Dupont, Francisco JR Ruiz, Jordan S Ellenberg, Pengming Wang, Omar Fawzi, et al. Mathematical discoveries from program search with large language models. Nature, 2023.
Mario Sanger, Ninon De Mecquenem, Katarzyna Ewa Lewi ¨ nska, Vasilis Bountris, Fabian Lehmann, ´ Ulf Leser, and Thomas Kosch. Large language models to the rescue: Reducing the complexity in scientific workflow development using chatgpt. arXiv preprint, 2023.
Timo Schick, Jane Dwivedi-Yu, Roberto Dess`ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. NeurIPS, 2024.
Bilgehan Sel, Ahmad Al-Tawaha, Vanshaj Khattar, Lu Wang, Ruoxi Jia, and Ming Jin. Algorithm of thoughts: Enhancing exploration of ideas in large language models. arXiv preprint, 2023.
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugging-gpt: Solving ai tasks with chatgpt and its friends in hugging face. NeurIPS, 2024.
Seunggyoon Shin, Seunggyu Chang, and Sungjoon Choi. Past as a guide: Leveraging retrospective learning for python code completion. arXiv preprint, 2023.
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 2024.
Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, et al. Prioritizing safeguarding over autonomy: Risks of llm agents for science. arXiv preprint, 2024.
XAgent Team. Xagent: An autonomous agent for complex task solving, 2023.
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint, 2023.
Marcel Waldvogel. Fast longest prefix matching: algorithms, analysis, and applications. Doctoral dissertation, SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH, 2000.
Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language models. arXiv preprint, 2023.
Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better llm agents. arXiv preprint, 2024.
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multiagent conversation framework. arXiv preprint, 2023a.
Yiran Wu, Feiran Jia, Shaokun Zhang, Qingyun Wu, Hangyu Li, Erkang Zhu, Yue Wang, Yin Tat Lee, Richard Peng, and Chi Wang. An empirical study on challenging math problem solving with gpt-4. arXiv preprint, 2023b.
Zhiyu Yang, Zihan Zhou, Shuo Wang, Xin Cong, Xu Han, Yukun Yan, Zhenghao Liu, Zhixing Tan, Pengyuan Liu, Dong Yu, et al. Matplotagent: Method and evaluation for llm-based agentic scientific data visualization. arXiv preprint, 2024.
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint, 2022.
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. NeurIPS, 2024.
Lifan Yuan, Yangyi Chen, Xingyao Wang, Yi R. Fung, Hao Peng, and Heng Ji. Craft: Customizing llms by creating and retrieving from specialized toolsets. arXiv preprint, 2023.
Zhuosheng Zhang, Yao Yao, Aston Zhang, Xiangru Tang, Xinbei Ma, Zhiwei He, Yiming Wang, Mark Gerstein, Rui Wang, Gongshen Liu, et al. Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents. arXiv preprint, 2023.
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. Expel: Llm agents are experiential learners. arXiv preprint, 2023.
Xiawu Zheng, Yang Zhang, Sirui Hong, Huixia Li, Lang Tang, Youcheng Xiong, Jin Zhou, Yan Wang, Xiaoshuai Sun, Pengfei Zhu, et al. Evolving fully automated machine learning via lifelong knowledge anchors. TPAMI, 2021.
Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint, 2023a.
Wangchunshu Zhou, Yuchen Eleanor Jiang, Long Li, Jialong Wu, Tiannan Wang, Shi Qiu, Jintian Zhang, Jing Chen, Ruipu Wu, Shuai Wang, et al. Agents: An open-source framework for autonomous language agents. arXiv preprint, 2023b.
Mingchen Zhuge, Haozhe Liu, Francesco Faccio, Dylan R Ashley, Robert Csordas, Anand Gopalakrishnan, Abdullah Hamdi, Hasan Abed Al Kader Hammoud, Vincent Herrmann, Kazuki Irie, et al. Mindstorms in natural language-based societies of mind. arXiv preprint, 2023.
Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and Jurgen Schmidhuber. Language agents as optimizable graphs. arXiv preprint, 2024.
For a deeper understanding, Table 5 presents the results on the ML-benchmark for both Completion Rate and Normalized Performance Score metrics. Additionally, Table 6 showcases the results of ablation experiments on the ML-benchmark, focusing on the completion rate (CR) and normalized performance score (NPS).
We present the results by the Data Interpreter of several open-ended tasks in two figures: tasks 8, 9, 10, and 13 in Figure 9, and tasks 4, 14, and 15 in Figure 10.
Figure 11 illustrates the results of data analysis and visualization of the Data Interpreter.
Here is an example of a plan. The user requirement is: “This is a dataset featuring sensor readings from industrial machines, aimed at predicting machine operational status (normal or faulty). Visualize the analysis and prediction results with high-quality graphs. Train data path: {train path}, eval data path: {eval path}.”
Figure 12 illustrates the structure of each task. Each task is structured to include instructions, dependencies array, code, and a flag. Specifically, dependencies array and flag are designed to maintain and manage the node’s dependency and runtime status, while instructions and code describe tasks in natural and coding languages respectively.
Since the code of each task is automatically executed in the executor, the corresponding code is directly entered into the executor used by its predecessor task to ensure the consistency of code variables between sequential tasks. During the task execution process, the execution results will be stored as runtime results.
The tools of our Data Interpreter are listed in Table 7
B.3.1 AN EXAMPLE OF TOOL SCHEMA
B.3.2 TOOL USAGE PROMPTS
We use two types of prompts for tool utilization. For open-ended tasks, we use zero-shot prompts, and for machine-learning tasks, we use one-shot prompts as illustrated below.
Figures 13 to 16 showcase several typical open-ended tasks in the following illustrations. For each task, we include the necessary data, user requirements, and assessment pipeline.
Here are the details about the ML-Benchmark dataset. We collect several typical datasets from Kaggle[1] and machine learning. Details are in Table 8
This paper is available on arxiv under CC BY 4.0 DEED license.