paint-brush
How Fine-Tuning Impacts Deductive Verification in Vicuna Modelsby@cosmological
New Story

How Fine-Tuning Impacts Deductive Verification in Vicuna Models

tldt arrow

Too Long; Didn't Read

Fine-tuning Vicuna models with a custom dataset enhances their deductive verification accuracy, particularly for GSM8K and other datasets. However, they still fall short compared to non-finetuned GPT-3.5 models, highlighting the influence of model capacity on verification performance.
featured image - How Fine-Tuning Impacts Deductive Verification in Vicuna Models
Cosmological thinking: time, space and universal causation  HackerNoon profile picture

Authors:

(1) Zhan Ling, UC San Diego and equal contribution;

(2) Yunhao Fang, UC San Diego and equal contribution;

(3) Xuanlin Li, UC San Diego;

(4) Zhiao Huang, UC San Diego;

(5) Mingu Lee, Qualcomm AI Research and Qualcomm AI Research

(6) Roland Memisevic, Qualcomm AI Research;

(7) Hao Su, UC San Diego.

Abstract and Introduction

Related work

Motivation and Problem Formulation

Deductively Verifiable Chain-of-Thought Reasoning

Experiments

Limitations

Conclusion, Acknowledgements and References


A Deductive Verification with Vicuna Models

B More Discussion on Improvements of Deductive Verification Accuracy Versus Improvements on Final Answer Correctness

C More Details on Answer Extraction

D Prompts

E More Deductive Verification Examples

A Deductive Verification with Vicuna Models

We further explore the efficacy of deductive verification for open-source models. We select two popular models: Vicuna-7B and Vicuna-13B [7]. These models are fine-tuned versions of LLaMA-7B and LLaMA-13B [47] using the ShareGPT data[3]. We use the same Natural Program-based one-shot verification method we used in the main paper. Results are shown in the first and the third rows of Table 9. We observe for the original Vicuna models without finetuning, Vicuna-7B exhibits poor performance in deductive verification and fails to find out reasoning mistakes, while the larger Vicuna-13B exhibits better verification accuracy.


Table 9: One-shot Deductive Verification Accuracy of Vicuna-7B and Vicuna-13B. The models are evaluated with or without finetuning on our deductive verification dataset. For each dataset, we randomly sample 50 reasoning chains that are deductively valid and 50 reasoning steps exhibiting incorrect reasonings.


We therefore conduct an additional experiment to investigate if the verification accuracy of Vicuna models can be improved by fine-tuning. To this end, we generate a deductive verification dataset, which consists of 2000 reasoning steps evenly distributed between correct and incorrect categories. We automatically generate this dataset using GPT-3.5-turbo since it exhibits a very high accuracy of single-step verification. We first use GPT-3.5-turbo to generate solutions for problems in GSM8K’s training set. We then execute step-by-step deductive verification on these solutions using GPT-3.5- turbo. For solutions that result in correct final answers, we retain the reasoning steps that pass deductive verification. For solutions that yield incorrect final answers, we retain the reasoning steps that cannot pass deductive verification. After constructing our dataset, we then fine-tune the Vicuna models using the verifications of the 2000 reasoning steps. Models were fine-tuned with 4 A100-80GB over 3 epochs. Training parameters are shown in Table 10.


As shown in Tab. 9, we observe that fine-tuning with our dataset can enhance the deductive verification accuracy of Vicuna models not only on the dataset where the training dataset is constructed (GSM8K), but also on many other datasets. However, the accuracy is still worse than non-finetuned GPT-3.5, which suggests that model capacity has a significant impact on deductive verification capabilities.


This paper is available on arxiv under CC BY 4.0 DEED license.


[3] https://github.com/domeccleston/sharegpt