3. LLaVA-Phi and 3.1. Training
5. Conclusion, Limitation, and Future Works and References
We rigorously evaluated LLaVA-Phi using an extensive array of academic benchmarks specifically designed for multi-modal models. These included tests for general question-answering such as VQA-v2 [12], VizWizQA [14], ScienceQA [28], and TextQA [31], as well as more specialized assessments like POPE [22] for evaluating object hallucination, and MME [10], MMBench [27], and MMVet [36] for a comprehensive evaluation of diverse multi-modal abilities, such as visual understanding and visual commonsense reasoning.
These benchmarks are meticulously structured to challenge and scrutinize complex multi-modal tasks. We benchmarked LLaVA-Phi against a variety of state-of-the-art, large vision-language models, as detailed in Table 1. It is important to note that both our method and LLaVA1.5 utilize the same publicly available datasets for pre-training and visual instruction fine-tuning.
Our model demonstrated a capacity for visual-based question-answering, surpassing many existing large multimodal models. Remarkably, LLaVA-Phi outperformed models that use 7B-parameter or larger Large Language Models (LLMs) as their backbone, such as IDEFICS [17] and InstructBLIP [8]. A particularly notable achievement was our model’s best performance on ScienceQA [28]. We attribute this success to the Phi-2 language model, which has been specifically trained on code generation and mathematical corpora, thereby enhancing our multi-modal model’s prowess in math-based question-answering.
In the comprehensive multi-modal benchmark of MMBench [27], LLaVA-Phi showed significantly superior performance compared to many existing 7B-LLM-based vision-language models. For example, our model outperformed Otter by 11.5% and InstructBLIP by 23.8%. This underscores the effectiveness of LLaVA-Phi in handling complex multi-modal tasks, reinforcing the potential of smaller, more efficient models in the rapidly evolving field of multi-modal models.
We also compared to MobileVLM [6], a concurrent work that builds up an efficient vision-language model. Across all five benchmarks, our LLaVA-Phi consistently outperforms their method. It’s important to note that the margins of lead are modest, with the exception of ScienceQA. We attribute this performance disparity primarily to the differences in the pretraining stages of the language models.
This paper is available on arxiv under CC BY 4.0 DEED license.
Authors:
(1) Yichen Zhu, Midea Group;
(2) Minjie Zhu, Midea Group and East China Normal University;
(3) Ning Liu, Midea Group;
(4) Zhicai Ou, Midea Group;
(5) Xiaofeng Mou, Midea Group.