Table of Links Abstract and 1 Introduction 2. Related Work 3. LLaVA-Phi and 3.1. Training 3.2. Qualitative Results 4. Experiments 5. Conclusion, Limitation, and Future Works and References 4. Experiments We rigorously evaluated LLaVA-Phi using an extensive array of academic benchmarks specifically designed for multi-modal models. These included tests for general question-answering such as VQA-v2 [12], VizWizQA [14], ScienceQA [28], and TextQA [31], as well as more specialized assessments like POPE [22] for evaluating object hallucination, and MME [10], MMBench [27], and MMVet [36] for a comprehensive evaluation of diverse multi-modal abilities, such as visual understanding and visual commonsense reasoning. These benchmarks are meticulously structured to challenge and scrutinize complex multi-modal tasks. We benchmarked LLaVA-Phi against a variety of state-of-the-art, large vision-language models, as detailed in Table 1. It is important to note that both our method and LLaVA1.5 utilize the same publicly available datasets for pre-training and visual instruction fine-tuning. Our model demonstrated a capacity for visual-based question-answering, surpassing many existing large multimodal models. Remarkably, LLaVA-Phi outperformed models that use 7B-parameter or larger Large Language Models (LLMs) as their backbone, such as IDEFICS [17] and InstructBLIP [8]. A particularly notable achievement was our model’s best performance on ScienceQA [28]. We attribute this success to the Phi-2 language model, which has been specifically trained on code generation and mathematical corpora, thereby enhancing our multi-modal model’s prowess in math-based question-answering. In the comprehensive multi-modal benchmark of MMBench [27], LLaVA-Phi showed significantly superior performance compared to many existing 7B-LLM-based vision-language models. For example, our model outperformed Otter by 11.5% and InstructBLIP by 23.8%. This underscores the effectiveness of LLaVA-Phi in handling complex multi-modal tasks, reinforcing the potential of smaller, more efficient models in the rapidly evolving field of multi-modal models. We also compared to MobileVLM [6], a concurrent work that builds up an efficient vision-language model. Across all five benchmarks, our LLaVA-Phi consistently outperforms their method. It’s important to note that the margins of lead are modest, with the exception of ScienceQA. We attribute this performance disparity primarily to the differences in the pretraining stages of the language models. This paper is available on arxiv under CC BY 4.0 DEED license. Authors:
(1) Yichen Zhu, Midea Group;
(2) Minjie Zhu, Midea Group and East China Normal University;
(3) Ning Liu, Midea Group;
(4) Zhicai Ou, Midea Group;
(5) Xiaofeng Mou, Midea Group. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Work 2. Related Work 3. LLaVA-Phi and 3.1. Training 3. LLaVA-Phi and 3.1. Training 3.2. Qualitative Results 3.2. Qualitative Results 4. Experiments 4. Experiments 5. Conclusion, Limitation, and Future Works and References 5. Conclusion, Limitation, and Future Works and References 4. Experiments We rigorously evaluated LLaVA-Phi using an extensive array of academic benchmarks specifically designed for multi-modal models. These included tests for general question-answering such as VQA-v2 [12], VizWizQA [14], ScienceQA [28], and TextQA [31], as well as more specialized assessments like POPE [22] for evaluating object hallucination, and MME [10], MMBench [27], and MMVet [36] for a comprehensive evaluation of diverse multi-modal abilities, such as visual understanding and visual commonsense reasoning. These benchmarks are meticulously structured to challenge and scrutinize complex multi-modal tasks. We benchmarked LLaVA-Phi against a variety of state-of-the-art, large vision-language models, as detailed in Table 1. It is important to note that both our method and LLaVA1.5 utilize the same publicly available datasets for pre-training and visual instruction fine-tuning. Our model demonstrated a capacity for visual-based question-answering, surpassing many existing large multimodal models. Remarkably, LLaVA-Phi outperformed models that use 7B-parameter or larger Large Language Models (LLMs) as their backbone, such as IDEFICS [17] and InstructBLIP [8]. A particularly notable achievement was our model’s best performance on ScienceQA [28]. We attribute this success to the Phi-2 language model, which has been specifically trained on code generation and mathematical corpora, thereby enhancing our multi-modal model’s prowess in math-based question-answering. In the comprehensive multi-modal benchmark of MMBench [27], LLaVA-Phi showed significantly superior performance compared to many existing 7B-LLM-based vision-language models. For example, our model outperformed Otter by 11.5% and InstructBLIP by 23.8%. This underscores the effectiveness of LLaVA-Phi in handling complex multi-modal tasks, reinforcing the potential of smaller, more efficient models in the rapidly evolving field of multi-modal models. We also compared to MobileVLM [6], a concurrent work that builds up an efficient vision-language model. Across all five benchmarks, our LLaVA-Phi consistently outperforms their method. It’s important to note that the margins of lead are modest, with the exception of ScienceQA. We attribute this performance disparity primarily to the differences in the pretraining stages of the language models. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group. Authors: Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group.

Part of HackerNoon's growing list of open-source research papers, promoting free access to academic material.

LLaVA-Phi: How We Rigorously Evaluated It Using an Extensive Array of Academic Benchmarks

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

The Noonification: Exploring Tool-Integrated Reasoning: Innovating Math-Proficient LLMs (10/5/2023)

The Noonification: Simple Database Migration Scripts On Your CI/CD step (10/16/2023)

The Noonification: Breaking Axioms in Program Execution (11/5/2023)

The Noonification: The Easiest Way to Create Your First NPM Package (12/15/2023)

102 Languages, One Model: The Multimodal AI Breakthrough You Need to Know

100 Days of AI Day 2: Enhancing Prompt Engineering for ChatGPT

The Noonification: Exploring Tool-Integrated Reasoning: Innovating Math-Proficient LLMs (10/5/2023)

The Noonification: Simple Database Migration Scripts On Your CI/CD step (10/16/2023)

The Noonification: Breaking Axioms in Program Execution (11/5/2023)

The Noonification: The Easiest Way to Create Your First NPM Package (12/15/2023)

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps