Authors:
(1) Wanru Zhao, University of Cambridge, Shanghai AI Laboratory with Equal contribution;
(2) Yaxin Du, Shanghai Jiao Tong University with Equal contribution;
(3) Nicholas D. Lane, University of Cambridge and Flower Labs;
(4) Siheng Chen, Shanghai AI Laboratory and Shanghai Jiao Tong University;
(5) Yanfeng Wang, Shanghai AI Laboratory and Shanghai Jiao Tong University.
Training setups We use Hugging Face’s transformers library (Wolf et al., 2020) and PEFT library (Mangrulkar et al., 2022) for loading pre-trained models and LoRA configurations. For our training and evaluation, we use PyTorch as the underlying auto-differentiation framework (Paszke et al., 2019). We use the AdamW optimizer (Loshchilov & Hutter, 2019; Kingma & Ba, 2015) for all experiments. All the experiments are conducted on the machines with the same hardware configuration using NVIDIA A40. In all experiments, we use 8 bit quantization, set batch_size = 16, and LoRA rank to 16. For the federated setting, we consider 300 communication rounds, n = 20 clients, 10 local steps of model training (equal to 3 epoch for local training).
Training data We use 16k sample in total, with 8k samples randomly sampled from PMCLLama (Wu et al., 2023) and Medalpaca-flashcards (Han et al., 2023) each. In the low-quality data synthetic processes, 3.2k samples (40% total data) are either polluted with cut (10% total data), delete (15% total data) and exchange (15% of total data) tricks. These 40% low-quality data together with the rest of high-quality data composites the low-quality data set. While the oracle data set only includes the rest of high-quality data set.
GPT-4 scoring 200 samples are randomly selected from the Medalpaca-flashcards dataset to serve as the test set. We evaluate the models that need to be compared on the test set to generate responses respectively. Then we use OpenAI GPT-4 model API to assign scores for their responses. Each response of is rated by the judge on a scale from 0 to 1, reflecting how well the answer aligns with the ground truth.
OpenAssistant OpenAssistant (Köpf et al., 2023) is a reward model built on the DeBERTa (He et al., 2020) architecture, trained with human feedback data from studies by Nakano et al. (2021); Stiennon et al. (2020); Bai et al. (2022). In this metric, 200 samples were randomly selected from the Medalpaca-flashcards dataset as the test set, with each assigned a score from -5 to 5 to reflect data quality.
Knowledge-based benchmarks For evaluation on multiple-choice questions, we use the USMLE (Jin et al., 2021), MedMCQA (Pal et al., 2022), PubMedQA (Jin et al., 2019) and MMLU clinical topics (Hendrycks et al., 2021b) datasets. More details refer to Appendix C.4.
MMLU MMLU (Hendrycks et al., 2021b) provides a comprehensive suite of tests for assessing text models in multi-task contexts. We utilized the clinical topics from the MMLU test set as our testing ground, encompassing a diverse range of subjects including 265 questions on Clinical Knowledge, 100 on Medical Genetics, 135 on Anatomy, 272 on Professional Medicine, 144 on College Biology, and 173 on College Medicine, all formatted as multiple-choice questions.
MedMCQA MedMCQA (Pal et al., 2022) is a dataset comprised of multiple-choice questions derived from mock exams and past papers of two major Indian medical school entrance examinations, namely AIIMS and NEET-PG. The dataset is divided into two parts: the training split, which includes 182,822 questions, and the test split, consisting of 4,183 questions. Each question in the dataset is accompanied by four possible answers.
PubMedQA PubMedQA (Jin et al., 2019) is a biomedical QA benchmark collected from PubMed abstracts. The PubMedQA task is designed to answer research questions with responses categorized as yes/no/maybe, effectively framing it as a multiple-choice question format. The dataset is divided into three subsets: 1,000 manually labeled question-answer pairs (denoted as PQA-L), 61,200 unlabeled pairs (PQA-U), and 211,300 pairs that have been artificially generated (PQA-A). Consistent with previous studies (Diao et al., 2023; Singhal et al., 2023), we employ the PQA-L subset as the test set for evaluating the model’s performance.
USMLE USMLE (Jin et al., 2021) consists of multiple-choice questions (with 4 choices per question) that are based on the United States Medical Licensing Exams. This dataset has been compiled from questions used in professional medical board examinations and is unique in its multilingual composition, including English, Simplified Chinese, and Traditional Chinese versions. It contains 12,724 questions in English, 34,251 in Simplified Chinese, and 14,123 in Traditional Chinese. For our purposes, we focus on the English component of the dataset, which is further divided into 10,178 questions for the training set, 1,273 for the validation set, and 1,273 for the test set, adhering to the official distribution of the dataset.
This paper is available on arxiv under CC BY 4.0 DEED license.