Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2 Pre-Training 2 Pre-Training 2.1 Tokenization 2.1 Tokenization 2.2 Pre-Training Data 2.2 Pre-Training Data 2.3 Stability 2.3 Stability 2.4 Inference 2.4 Inference 3 Alignment and 3.1 Data 3 Alignment and 3.1 Data 3.2 Fine-Tuning Strategy 3.2 Fine-Tuning Strategy 4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation 4 Human Evaluations and Safety Testing, and 4.1 Prompts for Evaluation 4.2 Baselines and Evaluations 4.2 Baselines and Evaluations 4.3 Inter-annotator Agreement 4.3 Inter-annotator Agreement 4.4 Safety Testing 4.4 Safety Testing 4.5 Discussion 4.5 Discussion 5 Benchmark Evaluations and 5.1 Text 5 Benchmark Evaluations and 5.1 Text 5.2 Image-To-Text 5.2 Image-To-Text 6 Related Work 6 Related Work 7 Conclusion, Acknowledgements, Contributors, and References 7 Conclusion, Acknowledgements, Contributors, and References Appendix Appendix A. Samples A. Samples B. Additional Information of Human Evaluations B. Additional Information of Human Evaluations 2 Pre-Training Chameleon represents images, in addition to text, as a series of discrete tokens and takes advantage of the scaling properties of auto-regressive Transformers (Ramesh et al., 2021; Aghajanyan et al., 2022, 2023; Yu et al., 2023). We present any ordering of images and text during training ranging from text-only, to single text/image pairs to full interleaved text-image documents. 2.1 Tokenization Image Tokenization We train a new image tokenizer based on Gafni et al. (2022), which encodes a 512 × 512 image into 1024 discrete tokens from a codebook of size 8192. For training this tokenizer, we use only licensed images. Given the importance of generating human faces, we up-sample the percentage of images with faces during pre-training by 2 times. A core weakness of our tokenizer is in reconstructing images with a large amount of text, therefore upper bounding the capability of our models, when it comes to heavy OCR-related tasks. Image Tokenization Tokenizer We train a new BPE tokenizer (Sennrich et al., 2016) over a subset of the training data outlined below with a vocabulary size of 65,536, which includes the 8192 image codebook tokens, using the sentence piece library (Kudo and Richardson, 2018). Tokenizer 2.2 Pre-Training Data We delineate the pre-training stage into two separate stages. The first stage takes up the first 80% of training while the second stage takes the last 20%. For all Text-To-Image pairs we rotate so that 50% of the time the image comes before the text (i.e., captioning). 2.2.1 First Stage 2.2.1 First Stage In the first stage we use a data mixture consisting of the following very large scale completely unsupervised datasets. Text-Only: We use a variety of textual datasets, including a combination of the pre-training data used to train LLaMa-2 (Touvron et al., 2023) and CodeLLaMa (Roziere et al., 2023) for a total of 2.9 trillion text-only tokens. Text-Only: Text-Image: The text-image data for pre-training is a combination of publicly available data sources and licensed data. The images are then resized and center cropped into 512 × 512 images for tokenization. In total, we include 1.4 billion text-image pairs, which produces 1.5 trillion text-image tokens. Text-Image: 1.5 trillion Text/Image Interleaved: We procure data from publicly available web sources, not including data from Meta’s products or services, for a total of 400 billion tokens of interleaved text and image data similar to Laurençon et al. (2023). We apply the same filtering for images, as was applied in Text-To-Image. Text/Image Interleaved: 400 billion Text-To-Image. 2.2.2 Second Stage 2.2.2 Second Stage In the second stage, we lower the weight of the first stage data by 50% and mix in higher quality datasets while maintaining a similar proportion of image text tokens. We additionally include a filtered subset of the train sets from a large collection of instruction tuning sets. Author: (1) Chameleon Team, FAIR at Meta. Author: Author: (1) Chameleon Team, FAIR at Meta. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv available on arxiv