Authors:
(1) Michael Moor, Department of Computer Science, Stanford University, Stanford, USA and these authors contributed equally to this work;
(2) Qian Huang, Department of Computer Science, Stanford University, Stanford, USA and these authors contributed equally to this work;
(3) Shirley Wu, Department of Computer Science, Stanford University, Stanford, USA;
(4) Michihiro Yasunaga, Department of Computer Science, Stanford University, Stanford, USA;
(5) Cyril Zakka, Department of Cardiothoracic Surgery, Stanford Medicine, Stanford, USA;
(6) Yash Dalmia, Department of Computer Science, Stanford University, Stanford, USA;
(7) Eduardo Pontes Reis, Hospital Israelita Albert Einstein, Sao Paulo, Brazil;
(8) Pranav Rajpurkar, Department of Biomedical Informatics, Harvard Medical School, Boston, USA;
(9) Jure Leskovec, Department of Computer Science, Stanford University, Stanford, USA.
6 Discussion, Acknowledgments, and References
The success of large language models (LLMs) Brown et al.; Liang et al. (2022); Qin et al. (2023) has led to significant advancements in training specialized models for the medical domain. This has resulted in the emergence of various models, including BioBERT Lee et al. (2020), ClinicalBERT Huang et al. (2019), PubMedBERT Gu et al. (2021), BioLinkBERT Yasunaga et al. (b), DRAGON Yasunaga et al. (a), BioMedLM Bolton et al., BioGPT Luo et al. (2022), and Med-PaLM Singhal et al.. Although these medical language models are typically smaller than general-purpose LLMs like GPT-3 Brown et al., they can match or even surpass their performance on medical tasks, such as medical question answering.
Recently, there has been a growing interest in extending language models to handle vision-language multimodal data and tasks Su et al. (2019); Ramesh et al.; Alayrac et al. (2022); Aghajanyan et al.; Yasunaga et al. (2023). Furthermore, many medical applications involve multimodal information, such as radiology tasks that require the analysis of both X-ray images and radiology reports Tiu et al. (2022). Motivated by these factors, we present a medical vision-language model (VLM). Existing medical VLMs include BiomedCLIP Zhang et al. (2023a), MedVINT Zhang et al. (2023b). While BiomedCLIP is an encoder-only model, our focus lies in developing a generative VLM, demonstrating superior performance compared to MedVINT. Finally, Llava-Med is another recent medical generative VLM Li et al. (2023), however the model was not yet available for benchmarking.
This paper is available on arxiv under CC BY-NC-SA 4.0 DEED license.