This story draft by @escholar has not been reviewed by an editor, YET.

Many-Shot In-Context Learning in Multimodal Foundation Models: Datasets

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Methods and 3.1 Models

3.2 Datasets

3.3 Evaluation Metrics

4 Results and 4.1 Increasing number of demonstrating examples

4.2 Impact of batching queries

4.3 Cost and latency analysis

5 Discussion

6 Conclusion and References

A. Prompts used for ICL experiments

B. Prompt selection

C. GPT4(V)-Turbo performance under many-shot ICL

D. Performance of many-shot ICL on medical QA tasks

Acknowledgments and Disclosure of Funding

3.2 Datasets

We benchmark the model performance on 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, and molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification). We choose to focus on image classification tasks as other tasks such as region captioning would require substantially more tokens thereby limiting the total number of demonstrating examples, and most LMMs are not yet capable of accurately producing localizations required for other tasks like bounding boxes and segmentation masks [17, 18]. Table 1 provides a summary of the datasets used in this study.


For all datasets, we construct a set of demonstration (demo) examples from the original training and validation splits used for in-context learning and a test set from the original test split (if one exists) to evaluate the performance of the models. We randomly sample the demo and test sets from the original dataset without replacement. For the multi-class and fine-grained classification datasets, we perform a class-stratified sampling, ensuring an equal number of examples per class in both the demo and test sets. For the multi-label classification dataset (CheXpert), we sample an equal number of positive and negative samples per class in both the demo and test sets. We note that, since the task is multi-label, this sampling procedure does not result in an exactly equal number of examples per class. The per-dataset sizes of the full demo and test sets are shown in Table 1, and we increase the number of demonstration examples up to the numbers shown in the table while ensuring class balance for the scaling experiments.


Authors:

(1) Yixing Jiang, Stanford University ([email protected]);

(2) Jeremy Irvin, Stanford University ([email protected]);

(3) Ji Hun Wang, Stanford University ([email protected]);

(4) Muhammad Ahmed Chaudhry, Stanford University ([email protected]);

(5) Jonathan H. Chen, Stanford University ([email protected]);

(6) Andrew Y. Ng, Stanford University ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks