This story draft by @escholar has not been reviewed by an editor, YET.

Exploring the design space of vision-language models and Are all pre-trained backbones

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
0-item

Table of Links

Abstract and 1 Introduction

2 Terminology

3 Exploring the design space of vision-language models and 3.1 Are all pre-trained backbones equivalent for VLMs?

3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?

3.3 Where are the efficiency gains?

3.4 How can one trade compute for performance?

4 Idefics2 - an open state-of-the-art vision-language foundation model and 4.1 Multi-stage pre-training

4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios

5 Conclusion, Acknowledgement, and References


A Appendix

A.1 Further experimental details of the ablations

A.2 Details of the instruction fine-tuning

A.3 Details of the evaluations

A.4 Red-teaming

3 Exploring the design space of vision-language models

In this section, we compare recurrent design choices in the vision-language model literature and highlight findings. Unless specified otherwise, we run the ablations for 6’000 steps and report the average score of the 4-shot performance on 4 downstream benchmarks measuring different capabilities: VQAv2 (Goyal et al., 2017) for general visual question answering, TextVQA (Singh et al., 2019) for OCR abilities, OKVQA (Marino et al., 2019) for external knowledge, and COCO (Lin et al., 2014) for captioning.

3.1 Are all pre-trained backbones equivalent for VLMs?

Most recent VLMs start from pre-trained unimodal backbones. How does the choice of the backbones (vision and text) influence the performance of the resulting VLM?


Table 1: Ablation on the language model backbone.


We fix the size of the pretrained backbones, the data used for multimodal pre-training, and the number of training updates. Under the cross-attention architecture, we observe that the greatest improvement in the performance on vision-language benchmarks comes from changing the language model to a better one. More specifically, replacing LLaMA-1-7B (Touvron et al., 2023) (35.1% on MMLU (Hendrycks et al., 2021)) by Mistral-7B (Jiang et al., 2023) (60.1% on MMLU) yields a boost of 5.1 (see Table 1). Additionally, switching the vision encoder from CLIP-ViT-H (Radford et al., 2021) (78.0% on ImageNet(Deng et al., 2009)) to SigLIP-SO400M (Zhai et al., 2023) (83.2% on ImageNet) yields a 3.3 increase in performance on the benchmarks (see Table 2). This result on better vision backbones corroborates observations from (Karamcheti et al., 2024).


We note that Chen and Wang (2022) reports a stronger increase in performance by scaling the size of the vision encoder compared to scaling the size of the language model even though scaling the vision encoder leads to a smaller parameter count increase. Although EVA-CLIP-5B (Sun et al., 2023) is ten times bigger in parameter counts than SigLIP-SO400M (Zhai et al., 2023), we obtain similar performance across 4 benchmarks, suggesting that EVA-CLIP-5B could be heavily under-trained, and we acknowledge that the open VLM community is missing a large well-trained vision encoder.


Table 2: Ablation on the vision encoder backbone.


Authors:

(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);

(2) Léo Tronchon, Hugging Face (the order was chosen randomly);

(3) Matthieu Cord, Sorbonne Université;

(4) Victor Sanh, Hugging Face.


This paper is available on arxiv under CC BY 4.0 DEED license.


L O A D I N G
. . . comments & more!

About Author

EScholar: Electronic Academic Papers for Scholars HackerNoon profile picture
EScholar: Electronic Academic Papers for Scholars@escholar
We publish the best academic work (that's too often lost to peer reviews & the TA's desk) to the global tech community

Topics

Around The Web...

Trending Topics

blockchaincryptocurrencyhackernoon-top-storyprogrammingsoftware-developmenttechnologystartuphackernoon-booksBitcoinbooks