LLaVA-Phi: Related Work to Get You Caught Up

Table of Links Abstract and 1 Introduction 2. Related Work 3. LLaVA-Phi and 3.1. Training 3.2. Qualitative Results 4. Experiments 5. Conclusion, Limitation, and Future Works and References 2. Related Work The rapid advancements in Large Language Models (LLMs) have significantly propelled the development of vision-language models based on LLMs. These models, representing a departure from the capabilities of the preLLM era, are equipped with advanced question-answering and visual comprehension skills. This progress is enabled by using LLMs as language encoding modules. Notable research in this domain includes the LLaVA-family [24, 25, 26, 32], the BLIP-family [8, 20], MiniGPT-4 [37], and others. Each has demonstrated significant advancements in managing visual-centric dialogues. However, a common limitation of these open-sourced Vision-Language Models (VLMs) is their substantial computational demands, typically ranging from 7B to 65B parameters. This requirement poses challenges for deployment on edge or mobile devices, especially in real-time applications. Gemini [33], a leader in this field, has released three versions of visionlanguage models, including the compact Gemini-Nano with 1.8B/3.25B parameters, tailored for smartphones. However, their models and data are not open-sourced. Another initiative, MobileVLM [6], has developed mobileLLaMA with 2.7B parameters to facilitate smaller vision-language models. Our paper explores and demonstrates the effectiveness of integrating vision-language models with open-sourced, smaller language models, assessing their potential and efficiency in a variety of applications. This paper is available on arxiv under CC BY 4.0 DEED license. Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group. Table of Links Abstract and 1 Introduction Abstract and 1 Introduction 2. Related Work 2. Related Work 3. LLaVA-Phi and 3.1. Training 3. LLaVA-Phi and 3.1. Training 3.2. Qualitative Results 3.2. Qualitative Results 4. Experiments 4. Experiments 5. Conclusion, Limitation, and Future Works and References 5. Conclusion, Limitation, and Future Works and References 2. Related Work The rapid advancements in Large Language Models (LLMs) have significantly propelled the development of vision-language models based on LLMs. These models, representing a departure from the capabilities of the preLLM era, are equipped with advanced question-answering and visual comprehension skills. This progress is enabled by using LLMs as language encoding modules. Notable research in this domain includes the LLaVA-family [24, 25, 26, 32], the BLIP-family [8, 20], MiniGPT-4 [37], and others. Each has demonstrated significant advancements in managing visual-centric dialogues. However, a common limitation of these open-sourced Vision-Language Models (VLMs) is their substantial computational demands, typically ranging from 7B to 65B parameters. This requirement poses challenges for deployment on edge or mobile devices, especially in real-time applications. Gemini [33], a leader in this field, has released three versions of visionlanguage models, including the compact Gemini-Nano with 1.8B/3.25B parameters, tailored for smartphones. However, their models and data are not open-sourced. Another initiative, MobileVLM [6], has developed mobileLLaMA with 2.7B parameters to facilitate smaller vision-language models. Our paper explores and demonstrates the effectiveness of integrating vision-language models with open-sourced, smaller language models, assessing their potential and efficiency in a variety of applications. This paper is available on arxiv under CC BY 4.0 DEED license. This paper is available on arxiv under CC BY 4.0 DEED license. available on arxiv Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group. Authors: Authors: (1) Yichen Zhu, Midea Group; (2) Minjie Zhu, Midea Group and East China Normal University; (3) Ning Liu, Midea Group; (4) Zhicai Ou, Midea Group; (5) Xiaofeng Mou, Midea Group.