Table of Links
3.2 How does the fully autoregressive architecture compare to the cross-attention architecture?
3.3 Where are the efficiency gains?
3.4 How can one trade compute for performance?
4.2 Instruction fine-tuning and 4.3 Optimizing for chat scenarios
5 Conclusion, Acknowledgement, and References
A Appendix
A.1 Further experimental details of the ablations
A.2 Details of the instruction fine-tuning
A.3 Details of the evaluations
A.2 Details of the instruction fine-tuning
A.2.1 Statistics of The Cauldron
In Table 14, we present the statistics of the datasets included in The Cauldron, as well as the text-only instruction datasets used for the supervised fine-tuning. For each dataset, we give the number of different images it contains, the number of question-answer pairs, the total number of tokens for the answers in the question-answer pairs, and the selected percentage of tokens it represents in our final mixture after upsampling or downsampling.
Table 14: The statistics of datasets used for instruction fine-tuning. # tokens is the total number of tokens for each dataset for the answers only. % mixture is our selected percentage of answer tokens for each dataset in the final mixture.
Authors:
(1) Hugo Laurençon, Hugging Face and Sorbonne Université, (the order was chosen randomly);
(2) Léo Tronchon, Hugging Face (the order was chosen randomly);
(3) Matthieu Cord, 2Sorbonne Université;
(4) Victor Sanh, Hugging Face.
This paper is
[10] https://huggingface.co/datasets/Kamizuru00/diagram_image_to_text
[11] https://huggingface.co/datasets/AtlasUnified/atlas-math-sets