Table of Links
4 Results and 4.1 Increasing number of demonstrating examples
4.2 Impact of batching queries
A. Prompts used for ICL experiments
C. GPT4(V)-Turbo performance under many-shot ICL
D. Performance of many-shot ICL on medical QA tasks
Acknowledgments and Disclosure of Funding
3.2 Datasets
We benchmark the model performance on 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We choose to focus on image classification tasks as other tasks such as region captioning would require substantially more tokens thereby limiting the total number of demonstrating examples, and most LMMs are not yet capable of accurately producing localizations required for other tasks like bounding boxes and segmentation masks [17, 18]. Table 1 provides a summary of the datasets used in this study.
For all datasets, we construct a set of demonstration (demo) examples from the original training and validation splits used for in-context learning and a test set from the original test split (if one exists) to evaluate the performance of the models. We randomly sample the demo and test sets from the original dataset without replacement. For the multi-class and fine-grained classification datasets, we perform a class-stratified sampling, ensuring an equal number of examples per class in both the demo and test sets. For the multi-label classification dataset (CheXpert), we sample an equal number of positive and negative samples per class in both the demo and test sets. We note that, since the task is multi-label, this sampling procedure does not result in an exactly equal number of examples per class. The per-dataset sizes of the full demo and test sets are shown in Table 1, and we increase the number of demonstration examples up to the numbers shown in the table while ensuring class balance for the scaling experiments.
Authors:
(1) Yixing Jiang, Stanford University ([email protected]);
(2) Jeremy Irvin, Stanford University ([email protected]);
(3) Ji Hun Wang, Stanford University ([email protected]);
(4) Muhammad Ahmed Chaudhry, Stanford University ([email protected]);
(5) Jonathan H. Chen, Stanford University ([email protected]);
(6) Andrew Y. Ng, Stanford University ([email protected]).
This paper is