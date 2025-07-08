Abstract and 1 Introduction

Architecture The Phi-3-Vision (4.2B parameters) is a multimodal model designed to process an image and a textual prompt as inputs, and subsequently generate textual outputs. This model is composed of two primary components: an image encoder, i.e., CLIP ViT-L/14 [RKH+ 21] and a transformer decoder, i.e., phi-3-mini-128K-instruct. The visual tokens, once extracted by the image encoder, are then combined with text tokens in an interleaved way (no particular order for image and text tokens). To accommodate high-resolution images and various aspect ratios, a dynamic cropping strategy [DZZ+ 24b] is utilized to split the input image into a 2d array of blocks, where the tokens of the blocks are concatenated to represent the whole image.





Pre-training The Phi-3-Vision model undergoes a pre-training phase using a diverse dataset, which consists of a combination of interleaved image-text documents (e.g., [LST+ 24]), image-text pairs from FLD-5B [XWX+ 24], synthetic data derived from Optical Character Recognition (OCR) of PDF files, datasets for chart/table comprehension, and text-only data. The objective of predicting the next token is employed specifically on text tokens, while any loss associated with image tokens is disregarded during this phase. The pre-training process involves a total of 0.5T tokens that encompass both visual and text elements. During the pre-training phase, the maximum image resolution is capped at 1344 ×1344 as the majority of the training images are smaller than this resolution.





Post-training. The Phi-3-Vision model contains two post-training stages: supervised finetuning (SFT) and direct preference optimization (DPO). For SFT, we leveraged text SFT dataset, public multimodal instruct tuning datasets along with large-scale multimodal instruct tuning datasets that we built ourselves, covering diverse domains and tasks such as general natural image understanding, chart/table/- diagram understanding/reasoning, PowerPoint understanding, and model safety. The multimodal SFT data has about a total of 15B tokens. For DPO we mainly use a text DPO dataset and a relatively smaller-scale multimodal DPO dataset. For these two stages, we jointly train multimodal tasks and textonly tasks so that the model can achieve multi-modal reasoning while maintaining language capabilities as much as possible.





