paint-brush
Understanding the Impact of Deductive Verification on Final Answer Accuracyby@cosmological

Understanding the Impact of Deductive Verification on Final Answer Accuracy

tldt arrow

Too Long; Didn't Read

Improvements in deductive verification accuracy significantly enhance reasoning reliability but don’t always boost final answer correctness. This is due to the removal of valid reasoning chains that, despite correct answers, exhibit incorrect reasoning. Future work aims to better filter out incorrect reasoning chains to improve final answer accuracy.
featured image - Understanding the Impact of Deductive Verification on Final Answer Accuracy
Cosmological thinking: time, space and universal causation  HackerNoon profile picture

Authors:

(1) Zhan Ling, UC San Diego and equal contribution;

(2) Yunhao Fang, UC San Diego and equal contribution;

(3) Xuanlin Li, UC San Diego;

(4) Zhiao Huang, UC San Diego;

(5) Mingu Lee, Qualcomm AI Research and Qualcomm AI Research

(6) Roland Memisevic, Qualcomm AI Research;

(7) Hao Su, UC San Diego.

Abstract and Introduction

Related work

Motivation and Problem Formulation

Deductively Verifiable Chain-of-Thought Reasoning

Experiments

Limitations

Conclusion, Acknowledgements and References


A Deductive Verification with Vicuna Models

B More Discussion on Improvements of Deductive Verification Accuracy Versus Improvements on Final Answer Correctness

C More Details on Answer Extraction

D Prompts

E More Deductive Verification Examples

B More Discussion on Improvements of Deductive Verification Accuracy Versus Improvements on Final Answer Correctness

In the main paper, we demonstrated that our verification approach significantly improves the verification accuracy of reasoning chains (Tab. 3, 6, but barely improves the final answer accuracy (Tab. 4). We further analyze this phenomenon below:


Table 10: Hyperparameters for finetuning Vicuna models with our deductive verification dataset.


Consider the GSM8K dataset as an example (recall that the final answer for a problem is obtained through majority voting). Among all problems, 91.6% of problems have |(number of votes received by the correct answer) − (largest number of votes received by a single wrong answer)| > 2, and their final answers are unlikely to be changed through our deductive verification approach. For the rest of the cases (8.4%), where deductive verification is more likely to impact their final answers, we found that:


• Among all reasoning chains that arrive at correct answers (these correct-answer chains account for 49.4% of all reasoning chain candidates), 46.2% of reasoning chains are filtered out by our verification process.


• Among the reasoning chains that arrive at correct answer but are filtered out by our verification process, 76.3% indeed exhibit incorrect reasoning.


• Among the reasoning chains that arrive at correct answer and are not filtered out by our verification process, 78.0% indeed have correct reasonings.


• Among the reasoning chains that do not arrive at correct answer and exhibit incorrect reasonings (these account for 50.6% of all reasoning chain candidates), 40.6% are filtered out by our verification process.


The above statistics shows that a significant portion of reasoning chains that arrive at correct answers but exhibit incorrect reasoning are successfully eliminated. Therefore, the reliability and trustfulness of reasoning chains that arrive at the correct answers are significantly improved. Combined with the fact that a significant proportion of reasoning chains that exhibit incorrect answers are eliminated, and that our approach’s verification accuracy significantly improves over naive verification approaches, our primary goal to improve LLM reasoning reliability is accomplished.


Nevertheless, the removals of many reasoning chains yielding correct answers (specifically, a significant 46.2% × 49.4% of all chains) has a notable impact. This even exceeds the removals of reasoning chains with incorrect reasonings and answers (40.6% × 50.6% of all chains). As a result, there are fewer votes for the correct answer when generating final answers through majority voting, which limits the final answer accuracy. In the future, we believe that when a greater proportion of incorrect reasoning chains with incorrect answers are filtered out, we can improve the final answer accuracy.


This paper is available on arxiv under CC BY 4.0 DEED license.