Authors:
(1) Senthujan Senkaiahliyan M. Mgt, is with the Institute for Health Policy Management and Evaluation, Faculty of Public Health, University of Toronto and Peter Munk Cardiac Centre, University Health Network, Toronto ON, Canada;
(2) Augustin Toma MD, is with the Department of Medical Biophysics, Faculty of Medicine, University of Toronto, Toronto, ON, Canada;
(3) Jun Ma PhD, is with Peter Munk Cardiac Centre, University Health Network; Department of Laboratory Medicine and Pathobiology, University of Toronto; Vector Institute, Toronto, ON Canada;
(4) An-Wen Chan MD, is with the Institute for Health Policy Management and Evaluation, Faculty of Public Health and with the Division of Dermatology, Department of Medicine, University of Toronto, Toronto, ON, Canada;
(5) Andrew Ha MD, is with Peter Munk Cardiac Centre, University Health Network and the Division of Cardiology, Department of Medicine, University of Toronto, Toronto, ON, Canada;
(6) Kevin R. An MD, is with the Division of Cardiac Surgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;
(7) Hrishikesh Suresh MD, is with the Division of Neurosurgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;
(8) Barry Rubin MD, is with Peter Munk Cardiac Centre, University Health Network and the Division of Vascular Surgery, Department of Surgery, University of Toronto, Toronto, ON, Canada;
(9) Bo Wang PhD (Corresponding Author) is with Peter Munk Cardiac Centre, University Health Network; Department of Laboratory Medicine and Pathobiology and Department of Computer Science, University of Toronto; Vector Institute, Toronto, Canada. E-mail: [email protected].
Table of Links
Abstract and 1. Introduction GPT-4V(ision)
5. Discussion and Limitations, and References
3. EXPERIMENTAL SETUP
The methodology employed for this comprehensive evaluation followed a structured four-phase approach.
3.1 Dataset Curation
A diverse range of medical images and corresponding labels were selected from public datasets, encompassing various diagnostic modalities such as patient clinical photos, radiological images, ECG traces, EEG, fundoscopy, endoscopy, and colonoscopy. GPT-4V analyzed these images based on the prompts. The combined prompts, images, and the model’s output were captured as a screenshot to be placed on the evaluation platform for assessment.
3.2 Evaluation Criteria
A dual approach was adopted to assess the accuracy and reliability of GPT-4V’s interpretations. All images were evaluated by two senior surgical residents (K.R.A, H.S.) and a board-certified internal medicine physician (A.T.). ECGs and clinical photos of dermatologic conditions were additionally evaluated by a board-certified cardiac electrophysiologist (A.H.) and dermatologist (A.C.), respectively.
The following below are the questionnaires used for the evaluation
General Conditions (Diverse Modalities):
• 1) Rate the answer from 1-5.
• 2) Rate from 1-5 how comfortable you would be letting a medical student rely on this content to help learning.
• 3) Was the image interpreted correctly? (Yes/No)
• 4) Was the advice correct? (Yes/No)
• 5) Was the advice given dangerous? (Yes/No)
Cardiology (ECGs):
• 1) Rate the overall interpretation of the ECG (1-5).
• 2) Compared to a standard automated read of an ECG, would you consider this interpretation more competent? (Yes/No)
• 3) Rate from 1-5 how comfortable you would be letting a medical student rely on this content to help learning.
• 4) Would this interpretation be helpful in a medical student’s learning? (Yes/No)
• 5) General Comments:
Dermatology (Clinical Photos):
• 1) Rate the quality of the layman’s description of the rash (1-5)
• 2) Rate the quality of the medical description of the rash (1-5)
• 3) Rate the quality of the differential diagnosis (1-5)
• 4) General Comments
3.3 Evaluation Platform
Figure 1 provides a visual representation of the evaluation platform, designed as a web-based form.
• Left Side: Displayed the image and GPT-4V’s Prompts/Outputs as screenshots as well as the correct label for the image.
• Right Side: Comprised a section for evaluator feedback, including questions from the evaluation criteria.
3.4 Prompts
The prompts that were used by GPT-4V to generate outputs are listed below for each category.
General Conditions (Diverse Modalities):
•”You are an expert [insert specialty i.e. radiology, cardiology] tutor. Explain to a medical student what this patient photo indicates.”
•”If indicators like arrows, asterisks, or circles are present in the image, underline and expand on their significance.”
Cardiology (ECGs):
•”You are an expert cardiology tutor assisting a medical student. Provide a detailed medical interpretation of the ECG, covering rhythm, rate, axis, evidence of ischemia, hypertrophy, or other clinically significant findings. Finally, list a differential diagnosis based on the ECG findings.”
Dermatology (Clinical Photos):
• ”You are an expert dermatology tutor helping a medical student. Describe the rash seen in the photo in layman’s terms. Next, describe it using medical terminology. Finally, list a differential diagnosis for the given image.”
This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.