Table of Links
3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?
3.3 Where are the efficiency gains?
3.4 How can one trade compute for performance?
4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios
5 Conclusion, Acknowledgement, and References
A Appendix
A.1 Further experimental details of the ablations
A.2 Details of the instruction fine-tuning
A.3 Details of the evaluations
A.3 Details of the evaluations
A.3.1 Evaluation setup
We perform all evaluations with a batch size of 1 and greedy decoding.
For the multi-choice questions in MMMU, MathVista, MMBench, we evaluate with the same prompt used for similar types of datasets during the instruction fine-tuning:
For the open-ended questions in TextVQA, DocVQA, and VQAv2, we evaluate with the prompt:
We use the stop words Question, User, <end_of_utterance>and <eos> to stop a generation.
A.3.2 Expanded evaluation table
We report the expanded evaluation of Idefics2 and the comparison to other models in Table 15. This includes scores on VQAv2 (Goyal et al., 2017), which is widely adopted for evaluation. We acknowledge, though, that the metric used for the open-ended visual question answering benchmarks strongly penalizes models that do not generate in the same format as the ground truth. For example, answering "large" when the ground truth is "big" or more verbose reformulations will be counted as incorrect. Our manual qualitative analysis reveals that on benchmarks like VQAv2, the generations of two models differing by 5 points would be barely noticeable. This problem is less concerning for other open-ended benchmarks like TextVQA or DocVQA which require finding a text in an image, making the expected answer less prone to ambiguity.
A.3.3 Qualitative evaluation
We show in Figures 5, 6, and 7, examples of generations with Idefics2-chatty.
Authors:
(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);
(2) Léo Tronchon, Hugging Face (the order was chosen randomly);
(3) Matthieu Cord, 2Sorbonne Université;
(4) Victor Sanh, Hugging Face.
This paper is