Table of Links
3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?
3.3 Where are the efficiency gains?
3.4 How can one trade compute for performance?
4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios
5 Conclusion, Acknowledgement, and References
A Appendix
A.1 Further experimental details of the ablations
A.2 Details of the instruction fine-tuning
A.3 Details of the evaluations
A.1 Further experimental details of the ablations
A.1.1 Cross-attention vs. fully autoregressive architectures
We apply LoRA modules to the LLM for the fully autoregressive architecture and to the crossattention modules and the LLM for the cross-attention architecture. In Figure 4, we report the average performance with respect to the number of steps, the number of images, as well as the number of text tokens. We see an improvement across the board with the fully autoregressive architecture. Comparing the average score with these different axes is essential because the cross-attention architecture feeds a single token per image to the language model, against 64 for the fully autoregressive architecture with perceiver pooling. This implies that for the same training sequence length, the number of images and text tokens is different for the two architectures. Equivalently, the same multimodal document will yield different sequence lengths. Even though we fix the batch size in the comparison, the number of text tokens and number of images grow at different paces under the two architectures.
A.1.2 Comparing various vision backbones
We present in Table 10 the detailed results of comparing multiple vision backbones. While EVA-CLIP5B performs similarly to SigLIP-SO400M, we emphasize that it has 11 times more parameters. We also noticed in early experiments that TextVQA is the most sensitive benchmark to image resolution, which accounts for the performance increase.
A.1.3 Comparing various pooling strategies
We compare multiple pooling strategies: a simple linear layer that takes the flattened sequence of vision hidden states and projects it into a shorter sequence of visual tokens, as well as a Mapping Network (Mañas et al., 2023). The perceiver resampler significantly outperforms these two options (see Table 11).
We also ablate the number of layers in the perceiver resampler, and find no statistically significant differences when increasing the number of layers, similarly to results from Xiao et al. (2024). We settle on 3 layers out of caution to avoid any potential capacity bottleneck.
Finally, we add a 2-layer modality projection MLP on top of the vision encoder hidden states to project the vision hidden dimension to the language model hidden dimension prior to the perceiver resampler. These changes yield better performance as well (see Table 13).
A.1.4 Ablations on OCR data
We hypothesize that adding PDF documents helps the model learn to read text from images. In Table 7, we compare checkpoints trained with and without OCR documents, along with image resolution increase to ensure that the text is legible. We do not observe statistically significant differences when evaluating checkpoints in zero or few shot. Instead, we fine-tune the checkpoints on DocVQA for 500 steps with a learning rate of 1e − 5, leading to checkpoints showing much stronger differences.
Authors:
(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);
(2) Léo Tronchon, Hugging Face (the order was chosen randomly);
(3) Matthieu Cord, 2Sorbonne Université;
(4) Victor Sanh, Hugging Face.
This paper is