Table of Links
- Related Works
- Preliminary
- Q*: A General, Versatile and Agile Deliberation Framework for LLMs
- Experiments
- Conclusion and References
5 Experiments
5.1 Experimental Settings
Datasets. We evaluate the effectiveness of Q* on two math reasoning and one code generation tasks, where the dataset statistics have been summarized in Table 1. 1) GSM8K [2] is a dataset of
grade school math problems, where the solution is given in a one-line-per-step format with an exact numerical answer in the last line; 2) MATH [3] is a dataset consisting of math problems of high school math competitions, where the solutions are given in a format that mixes latex code and natural language; 3) MBPP [44] is an entry-level Python programming dataset, where the questions are coding challenges along with a test case that defines the function format. The solutions are Python code that is excepted to pass the pre-collected test cases of each question.
Implementation Details. The implementation of Q* method mainly includes three steps: 1) Q-value estimation. As discussed in Section 4.1, we propose several ways for estimating Q values and will illustrate the implementation details in the next paragraph; 2) Utility aggregation for each step (cf. Eq. (5)). For GSM8K dataset, we adopt a process reward model (PRM) trained on PRM800K [22] to model RP to provide an intermediate signal for each reasoning step, and use min as the aggregation function; For MATH dataset, we set g(st) = 0 for all passed states {si} t i=1 in each trajectory for fairness, because PRM800K contains data samples constructed from MATH testing set and there is a potential risk of data leakage; For MBPP dataset, we tokenize the code generated so far with function tokenize.generate_tokens and give a penalty of -0.5 if TokenError is raised, which is often the case that there are mismatched delimiters (e.g., parentheses, quotation marks) and invalid indention in the code. We use [−1] as the aggregation function to cancel the previous penalties since the code is generated on-the-fly and mismatched delimiters may be fixed in subsequent steps. 3) A* planning. For GSM8K and MATH datasets, we treat a single line outputted by the LLM as an action, while the action in MBPP is defined as a code snippet with 24 tokens when planning. Besides, in all experiments, we set λ = 1 when computing f-values and expand a state with K = 6 actions at each reasoning step. Finally, following the common practice of Best-of-N, we perform planning to collect N = 6 trajectories for each question, and select the one with the maximum f-value as the final result for evaluation.
For Q-value estimation, in our practice, we find that learning from rollout could be the most effective and robust way to collect precise Q-value labels. Specifically, given the prompt of questions, we will firstly perform random rollout to obtain complete trajectories with LLM, denoted as πθ, under the setting of high temperature, e.g., τ = 0.9 for math reasoning and τ = 0.2 for code generation, and split each trajectory into a series of states according to the newline token “\n”. Then, for each state-action pair in a trajectory, denoted as (st, at), we can perform random rollout or MCTS with the same LLM to collect a pool P of trajectories, and then select the best reasoning path with the highest accumulated rewards to construct the corresponding Q-value label of the current state-action pair. Note that the reward R(st, at) is given as 1 only if the obtained math numerical answer is correct or the program passes all test cases, indicating that the Q value of a state-action pair can be 1 only if it has the potential to generate a trajectory containing the correct answer.
5.2 Experimental Results
GSM8K. For the comparison on GSM8K dataset, we select Llama-2-7b [45] as our base model, whose accuracy can achieve 65.2% after finetuning on MetaMath [5]. Then, we treat Llama-2-7b finetuned on MetaMath as policy πθ, and perform random rollout to collect Q-value labels for training Q-value model (QVM). For utility aggregation, we train a process reward model (PRM) on PRM800K [22] to provide intermediate signal for each reasoning step. With PRM and QVM in hand, traditional methods tend to treat either of them as a verifier to select the Best-of-N trajectory or utilize them to perform PPO training of RLHF. As the results shown in Table 2, we can find that with the same PRM/QVM, using it for verification performs significantly better than using it for alignment. Further, in the comparison of planning-based methods, we can find that with the same QVM, Q* method with constant aggregated utility can still outperform Best-of-N method. With the PRM trained on PRM800K determining whether the intermediate reasoning steps are correct, Q* method that combines PRM and QVM achieves the best performance among all methods based on the same LLM, helping Llama-2-7b surpass the performance of close-sourced ChatGPT-turbo [46] and reaching an accuracy of 80.8%.
MATH. As the results shown in Table 3, considering the weak performance of Llama-2-7b finetuned with MetaMath for the MATH dataset, we seek for two other stronger LLMs to evaluate the effectiveness of our Q* method. One is Llama-2-7b fine-tuned on Synthetic Data [47], which is constructed following the instruction of scaling up the SFT data, and achieves 41.9% accuracy on MATH dataset, approaching the performance of GPT-4 [48]. The other base model is DeepSeekMath-7b [49], which could be the most powerful open-source 7b model for math reasoning on MATH dataset, achieving 50.8% accuracy in our evaluation. From the results shown in the second and third blocks of Table 3, we can find that Q* can still lead to further performance improvement compared to the Best-of-N method on either of base models. Additionally, it is noteworthy that the performance of DeepSeek-Math-7b enhanced with Q* has already surpassed a series of closed-source models on the leaderboard of MATH dataset[2], such as Gemini Ultra (4-shot), reaching an accuracy of 55.4% .
MBPP. As for the comparison on MBPP dataset, we also choose the most powerful open-source LLM in the aspect of code generation, specifically CodeQwen1.5-7b-Chat, as our base model for evaluating the effectiveness of Q*. Following a similar procedure of math reasoning, we train a QVM for Q-value estimation and manually construct the utility function as described in the previous part of implementation details (tokenize the code generated so far with function tokenize.generate_tokens and give a penalty of -0.5 if TokenError is raised). From the results shown in Table 4, we can find that Q* can still outperform Best-of-N method in the aspect of code generation, and help CodeQwen1.5- 7b-Chat to achieve 77.0% accuracy on MBPP dataset, which is also a promising performance in the leaderboard of MPBB[3].
6 Conclusion
Solving challenging multi-step reasoning problems requires LLMs to perform in-depth deliberation beyond auto-regressive token generation. In this paper, we present Q*, a general, versatile and agile deliberation framework for LLMs. Unlike existing deliberation methods which need extensive expertise to design a utility function for each specific task, Q* relies on ground-truth solely to train value model and can be easily applied to various reasoning tasks without modification. Moreover, by leveraging plug-and-play Q-value models as the heuristic function, Q* can effectively guide LLMs to solve various tasks without fine-tuning LLMs beforehand, which avoids potential performance degeneration on other tasks. Finally, our Q* is agile because we consider only a single step each time rather than complete rollouts (e.g., simulation in MCTS). Extensive empirical evaluations on math reasoning and code generation tasks confirm the superiority of our method.
References
[1] Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. arXiv preprint arXiv:2402.00157, 2024.
[2] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
[3] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
[4] Peiyi Wang, Lei Li, Zhihong Shao, RX Xu, Damai Dai, Yifei Li, Deli Chen, Y Wu, and Zhifang Sui. Math-shepherd: A label-free step-by-step verifier for LLMs in mathematical reasoning. arXiv preprint arXiv:2312.08935, 2023.
[5] Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
[6] Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
[7] Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qingwei Lin, and Daxin Jiang. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
[8] Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
[9] CodeGemma Team, Ale Jakse Hartman, Andrea Hu, Christopher A. Choquette-Choo, Heri Zhao, Jane Fine, Jeffrey Hui, Jingyue Shen, et al. Codegemma: Open code models based on gemma. 2024. URL https://goo.gle/codegemma.
[10] Jian Xie, Kai Zhang, Jiangjie Chen, Tinghui Zhu, Renze Lou, Yuandong Tian, Yanghua Xiao, and Yu Su. TravelPlanner: A benchmark for real-world planning with language agents. arXiv preprint arXiv:2402.01622, 2024.
[11] Bo Liu, Yuqian Jiang, Xiaohan Zhang, Qiang Liu, Shiqi Zhang, Joydeep Biswas, and Peter Stone. LLM+P: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
[12] Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. In NeurIPS, pages 79081–79094, 2023.
[13] Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and Subbarao Kambhampati. On the planning abilities of large language models - A critical investigation. In NeurIPS, pages 75993–76005, 2023.
[14] Kaya Stechly, Karthik Valmeekam, and Subbarao Kambhampati. Chain of thoughtlessness: An analysis of CoT in planning. arXiv preprint arXiv:2405.04776, 2024.
[15] Kahneman Daniel. Thinking, Fast and Slow. Macmillan, 2011.
[16] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, pages 24824–24837, 2022.
[17] Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
[18] Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. Complexity-based prompting for multi-step reasoning. In ICLR, 2022.
[19] Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
[20] Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023.
[21] Xiang Yue, Xingwei Qu, Ge Zhang, Yao Fu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653, 2023.
[22] Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.
[23] Jonathan Uesato, Nate Kushman, Ramana Kumar, Francis Song, Noah Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process-and outcome-based feedback. arXiv preprint arXiv:2211.14275, 2022.
[24] Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, and Lu Wang. Discriminator-guided multi-step reasoning with language models. arXiv preprint arXiv:2305.14934, 2023.
[25] Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023.
[26] Xidong Feng, Ziyu Wan, Muning Wen, Ying Wen, Weinan Zhang, and Jun Wang. AlphaZerolike tree-search can guide large language model decoding and training. arXiv preprint arXiv:2309.17179, 2023.
[27] Shibo Hao, Yi Gu, Haodi Ma, Joshua Jiahua Hong, Zhen Wang, Daisy Zhe Wang, and Zhiting Hu. Reasoning with language model is planning with world model. arXiv preprint arXiv:2305.14992, 2023.
[28] Yuchen Zhuang, Xiang Chen, Tong Yu, Saayan Mitra, Victor Bursztyn, Ryan A Rossi, Somdeb Sarkhel, and Chao Zhang. Toolchain*: Efficient action space navigation in large language models with A* search. arXiv preprint arXiv:2310.13227, 2023.
[29] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012.
[30] Peter E Hart, Nils J Nilsson, and Bertram Raphael. A formal basis for the heuristic determination of minimum cost paths. IEEE Transactions on Systems Science and Cybernetics, 4(2):100–107, 1968.
[31] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. In NeurIPS, pages 27730–27744, 2022.
[32] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
[33] Jiaming Ji, Boyuan Chen, Hantao Lou, Donghai Hong, Borong Zhang, Xuehai Pan, Juntao Dai, and Yaodong Yang. Aligner: Achieving efficient alignment through weak-to-strong correction. arXiv preprint arXiv:2402.02416, 2024.
[34] Rishi Hazra, Pedro Zuidberg Dos Martires, and Luc De Raedt. SayCanPay: Heuristic planning with large language models using learnable domain knowledge. In AAAI, pages 20123–20133, 2024.
[35] Dong Huang, Qingwen Bu, Jie M Zhang, Michael Luck, and Heming Cui. Agentcoder: Multi-agent-based code generation with iterative testing and optimisation. arXiv preprint arXiv:2312.13010, 2023.
[36] Noah Shinn, Federico Cassano, Beck Labash, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint cs.AI/2303.11366, 2023.
[37] Qwen Team. Code with codeqwen1.5, April 2024. URL https://qwenlm.github.io/ blog/codeqwen1.5/.
[38] Blai Bonet and Héctor Geffner. Planning as heuristic search. Artificial Intelligence, 129(1-2): 5–33, 2001.
[39] David Silver. Cooperative pathfinding. In AAAI-AIIDE, pages 117–122, 2005.
[40] Bobak Pezeshki, Radu Marinescu, Alexander Ihler, and Rina Dechter. AND/OR branch-andbound for computational protein design optimizing K. In UAI, pages 1602–1612, 2022.
[41] Stuart J Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Pearson, 2016.
[42] Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693, 2023.
[43] Martin Riedmiller. Neural fitted Q iteration–first experiences with a data efficient neural reinforcement learning method. In ECML, pages 317–328, 2005.
[44] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
[45] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[46] Kumar Shridhar, Koustuv Sinha, Andrew Cohen, Tianlu Wang, Ping Yu, Ram Pasunuru, Mrinmaya Sachan, Jason Weston, and Asli Celikyilmaz. The art of LLM refinement: Ask, refine, and trust. arXiv preprint arXiv:2311.07961, 2023.
[47] Chen Li, Weiqi Wang, Jingcheng Hu, Yixuan Wei, Nanning Zheng, Han Hu, Zheng Zhang, and Houwen Peng. Common 7b language models already possess strong math capabilities. arXiv preprint arXiv:2403.04706, 2024.
[48] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712, 2023.
[49] Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, YK Li, Y Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.
[50] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
[51] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
[52] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
Authors:
(1) Chaojie Wang*, Skywork AI;
(2) Yanchen Deng*, Nanyang Technological University;
(3) Zhiyi Lyu, Nanyang Technological University;
(4) Liang Zeng, Skywork AI;
(5) Jujie He, Skywork AI.
This paper is
[2] https://paperswithcode.com/sota/math-word-problem-solving-on-math
[3] https://paperswithcode.com/sota/code-generation-on-mbpp