This story draft by @escholar has not been reviewed by an editor, YET.

What matters when building vision-language models?: How can one trade compute for performance?

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1 Introduction

2 Terminology

3 Exploring the design space of vision-language models and 3.1 Are all pre-trained backbones equivalent for VLMs?

3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?

3.3 Where are the efficiency gains?

3.4 How can one trade compute for performance?

4 Idefics2 - an open state-of-the-art vision-language foundation model and 4.1 Multi-stage pre-training

4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios

5 Conclusion, Acknowledgement, and References


A Appendix

A.1 Further experimental details of the ablations

A.2 Details of the instruction fine-tuning

A.3 Details of the evaluations

A.4 Red-teaming

3.4 How can one trade compute for performance?

(Lin et al., 2023; Li et al., 2024; Liu et al., 2024; McKinzie et al., 2024) show that splitting an image into sub-images allows boosting the downstream performance with no changes to the model’s signature. An image is decomposed into sub-images (for instance 4 equal sub-images), which are then concatenated to the original image to form a sequence of 5 images. Additionally, the sub-images are resized to the original image’s size. This strategy however comes at the cost of a much higher number of tokens to encode the images.


We adopt this strategy during the instruction fine-tuning stage. Each single image becomes a list of 5 images: 4 crops and the original image. This way, at inference, the model is able to deal with standalone images (64 visual tokens per image), as well as artificially augmented images (320 visual tokens in total per image). We notice that this strategy is particularly useful for benchmarks like TextVQA and DocVQA, which require a sufficiently high resolution to extract the text in an image (see Table 9).


Moreover, when we apply image spitting to only 50% of the training samples (instead of 100% of the samples), we observe that it does not impair the performance increase that image splitting provides. Surprisingly, we find at evaluation time that increasing the resolution of the sub-images (and the standalone image) provides only a minor boost in performance compared to the improvement yielded by sole image splitting: 73.6% when increasing the resolution of the sub-images to the maximum vs 73.0% accuracy on our validation set of TextVQA, and respectively 72.7 vs 72.9 ANLS on the validation set of DocVQA.



Authors:

(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);

(2) Léo Tronchon, Hugging Face (the order was chosen randomly);

(3) Matthieu Cord, 2Sorbonne Université;

(4) Victor Sanh, Hugging Face.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks