This story draft by @escholar has not been reviewed by an editor, YET.

What matters when building vision-language models?: Details of the evaluations

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1 Introduction

2 Terminology

3 Exploring the design space of vision-language models and 3.1 Are all pre-trained backbones equivalent for VLMs?

3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?

3.3 Where are the efficiency gains?

3.4 How can one trade compute for performance?

4 Idefics2 - an open state-of-the-art vision-language foundation model and 4.1 Multi-stage pre-training

4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios

5 Conclusion, Acknowledgement, and References


A Appendix

A.1 Further experimental details of the ablations

A.2 Details of the instruction fine-tuning

A.3 Details of the evaluations

A.4 Red-teaming

A.3 Details of the evaluations

A.3.1 Evaluation setup

We perform all evaluations with a batch size of 1 and greedy decoding.


For the multi-choice questions in MMMU, MathVista, MMBench, we evaluate with the same prompt used for similar types of datasets during the instruction fine-tuning:



For the open-ended questions in TextVQA, DocVQA, and VQAv2, we evaluate with the prompt:



We use the stop words Question, User, <end_of_utterance>and <eos> to stop a generation.

A.3.2 Expanded evaluation table

We report the expanded evaluation of Idefics2 and the comparison to other models in Table 15. This includes scores on VQAv2 (Goyal et al., 2017), which is widely adopted for evaluation. We acknowledge, though, that the metric used for the open-ended visual question answering benchmarks strongly penalizes models that do not generate in the same format as the ground truth. For example, answering "large" when the ground truth is "big" or more verbose reformulations will be counted as incorrect. Our manual qualitative analysis reveals that on benchmarks like VQAv2, the generations of two models differing by 5 points would be barely noticeable. This problem is less concerning for other open-ended benchmarks like TextVQA or DocVQA which require finding a text in an image, making the expected answer less prone to ambiguity.

A.3.3 Qualitative evaluation

We show in Figures 5, 6, and 7, examples of generations with Idefics2-chatty.


Authors:

(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);

(2) Léo Tronchon, Hugging Face (the order was chosen randomly);

(3) Matthieu Cord, 2Sorbonne Université;

(4) Victor Sanh, Hugging Face.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks