Using LLMs to Correct Reasoning Mistakes: Related Works That You Should Know About

Authors: (1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: gladys.tyen@cl.cam.ac.uk); (2) Hassan Mansoor, Google Research (e-mail: hassan@google.com); (3) Victor Carbune, Google Research (e-mail: vcarbune@google.com); (4) Peter Chen, Google Research and Equal leadership contribution (chenfeif@google.com); (5) Tony Mak, Google Research and Equal leadership contribution (e-mail: tonymak@google.com). Table of Links Abstract and Introduction BIG-Bench Mistake Benchmark results Backtracking Related Works Conclusion, Limitations, and References A. Implementational details B. Annotation C. Benchmark scores 5 Related work Datasets To our knowledge, the only publicly available dataset containing mistake annotations in LLM outputs is PRM800K (Lightman et al., 2023), which is a dataset of solutions to Olympiad-level math questions. Our dataset BIG-Bench Mistake covers a wider range of tasks to explore the reasoning capabilities of LLMs more thoroughly. Additionally, the generator LLM used in PRM800K has been fine-tuned on 1.5B math tokens as well as a dataset of step-by-step math solutions. For this paper, we wanted to explore few-shot in-context learning methods, which is typically used in realworld applications with API-based LLMs. Self-correction Pan et al. (2023) present a plethora of self-correction methods in recent literature. While their list includes training-time correction strategies such as RLHF (Ouyang et al., 2022) and self-improve (Huang et al., 2022), our backtracking method falls into the category of post-hoc correction, where the correction process is applied to outputs that have already been generated. Our paper focuses on correction of logical and reasoning errors, rather than stylistic or qualitative improvements. Previous post-hoc correction methods that are applied to reasoning errors include Reflexion (Shinn et al., 2023) and RCI (Kim et al., 2023), both of which cause performance deterioration when the oracle label is not used (Huang et al., 2023). Other methods such as Self-Refine (Madaan et al., 2023) and iterative refinement (Chen et al., 2023) focus on q ualitative or stylistic improvements rather than correcting logical errors. This paper is available on arxiv under CC 4.0 license. Authors: (1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: gladys.tyen@cl.cam.ac.uk); (2) Hassan Mansoor, Google Research (e-mail: hassan@google.com); (3) Victor Carbune, Google Research (e-mail: vcarbune@google.com); (4) Peter Chen, Google Research and Equal leadership contribution (chenfeif@google.com); (5) Tony Mak, Google Research and Equal leadership contribution (e-mail: tonymak@google.com). Authors: Authors: (1) Gladys Tyen, University of Cambridge, Dept. of Computer Science & Technology, ALTA Institute, and Work done during an internship at Google Research (e-mail: gladys.tyen@cl.cam.ac.uk); (2) Hassan Mansoor, Google Research (e-mail: hassan@google.com); (3) Victor Carbune, Google Research (e-mail: vcarbune@google.com); (4) Peter Chen, Google Research and Equal leadership contribution (chenfeif@google.com); (5) Tony Mak, Google Research and Equal leadership contribution (e-mail: tonymak@google.com). Table of Links Abstract and Introduction Abstract and Introduction BIG-Bench Mistake BIG-Bench Mistake Benchmark results Benchmark results Backtracking Backtracking Related Works Related Works Conclusion, Limitations, and References Conclusion, Limitations, and References A. Implementational details A. Implementational details B. Annotation B. Annotation C. Benchmark scores C. Benchmark scores 5 Related work Datasets To our knowledge, the only publicly available dataset containing mistake annotations in LLM outputs is PRM800K (Lightman et al., 2023), which is a dataset of solutions to Olympiad-level math questions. Our dataset BIG-Bench Mistake covers a wider range of tasks to explore the reasoning capabilities of LLMs more thoroughly. Additionally, the generator LLM used in PRM800K has been fine-tuned on 1.5B math tokens as well as a dataset of step-by-step math solutions. For this paper, we wanted to explore few-shot in-context learning methods, which is typically used in realworld applications with API-based LLMs. Self-correction Pan et al. (2023) present a plethora of self-correction methods in recent literature. While their list includes training-time correction strategies such as RLHF (Ouyang et al., 2022) and self-improve (Huang et al., 2022), our backtracking method falls into the category of post-hoc correction, where the correction process is applied to outputs that have already been generated. Our paper focuses on correction of logical and reasoning errors, rather than stylistic or qualitative improvements. Previous post-hoc correction methods that are applied to reasoning errors include Reflexion (Shinn et al., 2023) and RCI (Kim et al., 2023), both of which cause performance deterioration when the oracle label is not used (Huang et al., 2023). Other methods such as Self-Refine (Madaan et al., 2023) and iterative refinement (Chen et al., 2023) focus on q ualitative or stylistic improvements rather than correcting logical errors. This paper is available on arxiv under CC 4.0 license. This paper is available on arxiv under CC 4.0 license. available on arxiv