This story draft by @escholar has not been reviewed by an editor, YET.

Many-Shot In-Context Learning in Multimodal Foundation Models: Evaluation Metrics

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Methods and 3.1 Models

3.2 Datasets

3.3 Evaluation Metrics

4 Results and 4.1 Increasing number of demonstrating examples

4.2 Impact of batching queries

4.3 Cost and latency analysis

5 Discussion

6 Conclusion and References

A. Prompts used for ICL experiments

B. Prompt selection

C. GPT4(V)-Turbo performance under many-shot ICL

D. Performance of many-shot ICL on medical QA tasks

Acknowledgments and Disclosure of Funding

3.3 Evaluation Metrics

We use standard metrics to evaluate model performance on each dataset. Specifically, we measure performance using accuracy for all multi-class classification datasets as they are sampled to have a balanced class distribution. For multi-label classification on CheXpert, we use the macro-averaged F1 metric. In the rare case of parsing errors, we consider the response as incorrect. To estimate the variability around the evaluation metrics, we compute standard deviation using bootstrapping with 1,000 bootstrap replicates.



Table 1: Summary of benchmark datasets. We use 10 datasets spanning multiple domains (natural imagery, medical imagery, remote sensing, molecular imagery) and tasks (multi-class, multi-label, and fine-grained classification).


Table 2: Many-shot ICL performance and efficiency comparison. We report the performance under a zero-shot regime and performance at the optimal demo set size as well as the many-shot ICL data efficiency of GPT-4o and Gemini 1.5 Pro. We measure performance using accuracy on all datasets except CheXpert, for which we use macro-average F1. We bold the highest ICL data efficiency between the two models on each dataset.


Authors:

(1) Yixing Jiang, Stanford University ([email protected]);

(2) Jeremy Irvin, Stanford University ([email protected]);

(3) Ji Hun Wang, Stanford University ([email protected]);

(4) Muhammad Ahmed Chaudhry, Stanford University ([email protected]);

(5) Jonathan H. Chen, Stanford University ([email protected]);

(6) Andrew Y. Ng, Stanford University ([email protected]).


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks