Authors:
(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;
(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;
(3) Nicholas D. Lane, University of Cambridge and Flower Labs;
(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;
(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.
Tasks and Datasets We conduct our experiments on the question-answering (QA) task, sampled from the following two datasets. 1) PMC-LLama (Wu et al., 2023) contributes a large scale, comprehensive dataset for instruction tuning. This dataset encompasses medical question-answering, rationale for reasoning, and conversational dialogues, comprising a total of 202M tokens. 2) Medalpaca-flashcards (Han et al., 2023) presents an innovative dataset consisting of over 514k entries, specifically crafted to fine-tune LLMs for effective medical applications.
Models Our pre-trained base model is the LLama2-7b model (Touvron et al., 2023b), shown to perform well across many domains. We adapt commonly-used parameter-efficient fine-tuning technique, Low-Rank Adaptation (LoRA), which involves freezing the pre-trained model weights and injecting trainable rank decomposition matrices into each layer of the Transformer architecture, thereby achieving fine-tuning without incurring any additional inference latency (Hu et al., 2021).
Evaluation metrics In this study, we classify three evaluation metrics across five benchmark tests commonly utilized in contemporary medical LLMs. (Singhal et al., 2023; Wu et al., 2023; Lee et al., 2020; Luo et al., 2023) The evaluations focus on two main aspects: 1) question-answering capabilities, assessed by GPT-4 (OpenAI et al., 2023) and Openassistant (Köpf et al., 2023) scoring within the Medalpaca-flashcards test set. The former metric gauges the consistency of the generated responses with the ground truth, while the latter simulates the human reward of the answers. 2) Knowledge acquisition ability, measured by average accuracy of responses to multiple-choice questions in the MMLU (Hendrycks et al., 2021a), MedMCQA (Pal et al., 2022), PubMedQA (Jin et al., 2019), and USMLE (Jin et al., 2021) datasets.
Based on the low-quality dataset setup described in the Section 2, we build up our our data-quality control pipeline in centralized, federated IID and Non-IID setting (described in Appendix C.1). We implement both probability-based and gradient-based scoring methods to calculate scores for each training data, and set the unified scoring standard using corresponding scoring functions with anchor data. In comparison, we add a baseline, Oracle, which only trains on high-quality data. We also consider the commonly used in-context learning (ICL) as another scoring method, provided in Appendix C.4.
The experimental results are presented in Table 1. Our findings include: 1) Utilizing our current data quality control pipeline, our data scoring techniques consistently outperform the models trained on low-quality datasets in both centralized and federated settings. 2) When employing ConPro and ICL as data scoring methods to select high-quality data, the performance of the global model can surpass that of models trained with the oracle set. 3) In the Non-IID setting, the global model trained with our quality-controlled data demonstrates excellent performance. These experimental results highlight not only the efficacy of the data scoring methods but also the effectiveness of a unified quality threshold determined by anchor data (for more details, see Appendix D).
This paper is available on arxiv under CC BY 4.0 DEED license.